转自http://blog.itpub.net/29065182/viewspace-1070553/


ASM磁盘组冗余的三种类型:external、normal、high,这里恢复的是normal状态,模拟OCR磁盘或votedisk不可用 时,RAC会出现什么现象?给出故障定位的整个过程。在11.2.0.3中表决盘是放到了ocr中,所以 OCR磁盘或votedisk 不可用的两个实验一起做。在11.2.0.3中可
ASM磁盘组冗余的三种类型:external、normal、high,这里恢复的是normal状态,模拟OCR磁盘或votedisk不可用 时,RAC会出现什么现象?给出故障定位的整个过程。在11.2.0.3中表决盘是放到了ocr中,所以OCR磁盘或votedisk不可用的两个实验一 起做。在11.2.0.3中可以手动备份OCR,但手动备份是无效的。

ocrconfig -export /u01/ocr.exp检查OCR有哪些备份:

[root@rac1 ~]# ocrconfig -showbackup
rac1     2013/07/22 05:39:51     /u01/grid/crs/cdata/rac/backup00.ocr
rac1     2013/07/22 01:39:51     /u01/grid/crs/cdata/rac/backup01.ocr
rac1     2013/07/21 21:39:50     /u01/grid/crs/cdata/rac/backup02.ocr
rac2     2013/07/21 01:52:54     /u01/grid/crs/cdata/rac/day.ocr
rac2     2013/07/09 01:52:25     /u01/grid/crs/cdata/rac/week.ocr
PROT-25: Manual backups for the Oracle Cluster Registry are not available注意:orcle明确给出了手动备份是无效的!

查看表决盘信息:

[root@rac1 ~]# crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   745716af7e5b4faebfc8d948d096aa55 (/dev/oracleasm/disks/OCR_VOT1) [OCR_VOT]
 2. ONLINE   7092079f66c04f9dbf65974d0dcc611a (/dev/oracleasm/disks/OCR_VOT2) [OCR_VOT]
 3. ONLINE   6510631353284f5fbf3d4c8839822dbd (/dev/oracleasm/disks/OCR_VOT3) [OCR_VOT]
Located 3 voting disk(s).
 
#停库:
[root@rac1 ~]# srvctl stop database -d orcl -o immediate
#停集群:
[root@rac1 ~]# crsctl stop cluster -all -f
 
#破坏OCR和VOT:
[root@rac1 ~]# dd if=/dev/zero f=/dev/mapper/mpathap1 bs=1024K count=1
记录了1+0 的读入
记录了1+0 的写出
1048576字节(1.0 MB)已复制,0.0160613 秒,65.3 MB/秒
[root@rac1 ~]# dd if=/dev/zero f=/dev/mapper/mpathap2 bs=1024K count=1
记录了1+0 的读入
记录了1+0 的写出
1048576字节(1.0 MB)已复制,0.00800275 秒,131 MB/秒
[root@rac1 ~]# dd if=/dev/zero f=/dev/mapper/mpathap3 bs=1024K count=1
记录了1+0 的读入
记录了1+0 的写出
1048576字节(1.0 MB)已复制,0.00927389 秒,113 MB/秒注意:破坏后,各节点服务一切正常:

[root@rac1 ~]# crs_stat -t
Name           Type           Target    State     Host       
------------------------------------------------------------
ora.DATA.dg    ora....up.type ONLINE    ONLINE    rac1    
ora.FRA.dg     ora....up.type ONLINE    ONLINE    rac1    
ora....ER.lsnr ora....er.type ONLINE    ONLINE    rac1    
ora....N1.lsnr ora....er.type ONLINE    ONLINE    rac1    
ora.OCR_VOT.dg ora....up.type ONLINE    ONLINE    rac1    
ora.asm        ora.asm.type   ONLINE    ONLINE    rac1    
ora.orcl.db   ora....se.type  ONLINE    ONLINE    rac1          
ora.cvu        ora.cvu.type   ONLINE    ONLINE    rac1    
ora....SM1.asm application    ONLINE    ONLINE    rac1    
ora....C1.lsnr application    ONLINE    ONLINE    rac1    
ora....ac1.gsd application    OFFLINE   OFFLINE              
ora....ac1.ons application    ONLINE    ONLINE    rac1    
ora....ac1.vip ora....t1.type ONLINE    ONLINE    rac1    
ora....SM2.asm application    ONLINE    ONLINE    rac2    
ora....C2.lsnr application    ONLINE    ONLINE    rac2    
ora....ac2.gsd application    OFFLINE   OFFLINE              
ora....ac2.ons application    ONLINE    ONLINE    rac2    
ora....ac2.vip ora....t1.type ONLINE    ONLINE    rac2    
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE              
ora....network ora....rk.type ONLINE    ONLINE    rac1    
ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    rac1    
ora.ons        ora.ons.type   ONLINE    ONLINE    rac1    
ora....ry.acfs ora....fs.type ONLINE    ONLINE    rac1    
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    rac1所有节点重启操作系统后集群服务启不来了:


[root@rac1 ~]# reboot如果只是停止集群服务,后面的重新创建ASM磁盘组会失败,但重启操作系统后,就可以创建成功。

检查CRS:


[grid@rac1 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager启动集群服务:


[root@rac1 ~]# crsctl start cluster -all 
CRS-2672: 尝试启动 'ora.cssdmonitor' (在 'rac1' 上)
CRS-2672: 尝试启动 'ora.cssdmonitor' (在 'rac2' 上)
CRS-2676: 成功启动 'ora.cssdmonitor' (在 'rac1' 上)
CRS-2676: 成功启动 'ora.cssdmonitor' (在 'rac2' 上)
CRS-2672: 尝试启动 'ora.cssd' (在 'rac1' 上)
CRS-2672: 尝试启动 'ora.diskmon' (在 'rac1' 上)
CRS-2672: 尝试启动 'ora.cssd' (在 'rac2' 上)
CRS-2672: 尝试启动 'ora.diskmon' (在 'rac2' 上)
CRS-2676: 成功启动 'ora.diskmon' (在 'rac1' 上)
CRS-2676: 成功启动 'ora.diskmon' (在 'rac2' 上)   
# 直停在这里其他终端使用其他命令启动集群服务:


[root@rac1 ~]# crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.操作系统及crs日志中没看到特别有用的信息:


[root@rac1 ~]# vi /var/log/messages
[grid@rac1 ~]# vi $ORACLE_HOME/log/rac1/crsd/crsd.logocss日志中提示:


vi $ORACLE_HOME/log/rac1/cssd/ocssd.log
2013-07-21 21:15:08.550: [    CSSD][1095031104]clssnmvFindInitialConfigs: No voting files found发现部分ASM磁盘没有了:


[root@rac1 ~]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks:               [  OK  ]
[root@rac1 ~]# /etc/init.d/oracleasm listdisks
DATA
FRA依照RAC安装文档重建ASM磁盘:


[root@rac1 ~]# /etc/init.d/oracleasm createdisk OCR_VOT1 /dev/mapper/mpathap1
Marking disk "OCR_VOT1" as an ASM disk:                    [  OK  ]
[root@rac1 ~]# /etc/init.d/oracleasm createdisk OCR_VOT2 /dev/mapper/mpathap2
Marking disk "OCR_VOT2" as an ASM disk:                    [  OK  ]
[root@rac1 ~]# /etc/init.d/oracleasm createdisk OCR_VOT3 /dev/mapper/mpathap3
Marking disk "OCR_VOT3" as an ASM disk:                    [  OK  ]停掉集群服务:
要加-f,否则可能停止非常慢


[root@rac1 ~]# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac1'
CRS-2673: Attempting to stop 'ora.crf' on 'rac1'
CRS-2677: Stop of 'ora.mdnsd' on 'rac1' succeeded
CRS-2677: Stop of 'ora.crf' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'rac1'
CRS-2677: Stop of 'ora.gipcd' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac1'
CRS-2677: Stop of 'ora.gpnpd' on 'rac1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has completed
CRS-4133: Oracle High Availability Services has been stopped.以-excl -nocrs 方式启动集群,这将启动ASM实例 但不启动CRS


[root@rac1 ~]# crsctl start crs -excl -nocrs
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1'
CRS-2676: Start of 'ora.mdnsd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1'
CRS-2676: Start of 'ora.gpnpd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1'
CRS-2672: Attempting to start 'ora.gipcd' on 'rac1'
CRS-2676: Start of 'ora.cssdmonitor' on 'rac1' succeeded
CRS-2676: Start of 'ora.gipcd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac1'
CRS-2676: Start of 'ora.diskmon' on 'rac1' succeeded
CRS-2676: Start of 'ora.cssd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'rac1'
CRS-2679: Attempting to clean 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'
CRS-2681: Clean of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2676: Start of 'ora.drivers.acfs' on 'rac1' succeeded
CRS-2676: Start of 'ora.ctssd' on 'rac1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'rac1'
CRS-2676: Start of 'ora.asm' on 'rac1' succeeded此时crs仍然报错:

[root@rac1 ~]# crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
 
[root@rac1 ~]# crsctl check crs             
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager重建原ocr和votedisk所在磁盘组:
注意:这里是在grid用户下


SQL> col path for a50
SQL> set lines 300
SQL> select path,header_status from v$asm_disk;
SQL> create diskgroup OCR_VOT normal redundancy disk '/dev/oracleasm/disks/OCR_VOT1','/dev/
oracleasm/disks/OCR_VOT2','/dev/oracleasm/disks/OCR_VOT3'
attribute 'compatible.rdbms' = '11.2','compatible.asm' = '11.2';ASM磁盘组冗余的三种类型:external、normal、high,我这里之前用的是normal。

从ocr backup中恢复OCR:
在每个节点grid用户下:


cd $ORACLE_HOME/cdata/rac
ocrconfig -restore /u01/grid/crs/cdata/rac/backup00.ocr恢复表决盘的准备工作:


show parameter asm_diskstring如果asm_diskstring没有值,表示ASM磁盘用的是默认ASM磁盘搜索路径。
修改成实际的ASM磁盘搜索路径:

alter system set asm_diskstring='/dev/oracleasm/disks/*';恢复表决盘:


[root@rac1 ~]# crsctl replace votedisk  +OCR_VOT
Successful addition of voting disk 4ad2b9cc0a754fffbf1515281199a78f.
Successful addition of voting disk 9f8dc1c013df4f39bfd85c64051a0bc1.
Successful addition of voting disk a4aea7a1aa434fb3bff161f6ea8ce102.
Successfully replaced voting disk group with +OCR_VOT.
CRS-4266: Voting file(s) successfully replacedocr和vot恢复后,crs等服务就会自动起来了。


[root@rac1 ~]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
 
[root@rac1 ~]# crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   4ad2b9cc0a754fffbf1515281199a78f (/dev/oracleasm/disks/OCR_VOT1) [OCR_VOT]
 2. ONLINE   9f8dc1c013df4f39bfd85c64051a0bc1 (/dev/oracleasm/disks/OCR_VOT2) [OCR_VOT]
 3. ONLINE   a4aea7a1aa434fb3bff161f6ea8ce102 (/dev/oracleasm/disks/OCR_VOT3) [OCR_VOT]
Located 3 voting disk(s).重启集群服务,检查是否已经恢复正常:


[root@rac1 ~]# crsctl stop crs
[root@rac1 ~]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.其他节点也可以启动了:


[root@rac2 ~]# crsctl start crs等一会儿,检查服务已经起来了:


[root@rac1 ~]# crs_stat -t
Name           Type           Target    State     Host       
------------------------------------------------------------
ora.DATA.dg    ora....up.type ONLINE    ONLINE    rac1    
ora.FRA.dg     ora....up.type ONLINE    ONLINE    rac1    
ora....ER.lsnr ora....er.type ONLINE    ONLINE    rac1    
ora....N1.lsnr ora....er.type ONLINE    ONLINE    rac1    
ora.OCR_VOT.dg ora....up.type ONLINE    ONLINE    rac1    
ora.asm        ora.asm.type   ONLINE    ONLINE    rac1    
ora.orcl.db   ora....se.type  ONLINE    ONLINE    rac1          
ora.cvu        ora.cvu.type   ONLINE    ONLINE    rac1    
ora....SM1.asm application    ONLINE    ONLINE    rac1    
ora....C1.lsnr application    ONLINE    ONLINE    rac1    
ora....ac1.gsd application    OFFLINE   OFFLINE              
ora....ac1.ons application    ONLINE    ONLINE    rac1    
ora....ac1.vip ora....t1.type ONLINE    ONLINE    rac1    
ora....SM2.asm application    ONLINE    ONLINE    rac2    
ora....C2.lsnr application    ONLINE    ONLINE    rac2    
ora....ac2.gsd application    OFFLINE   OFFLINE              
ora....ac2.ons application    ONLINE    ONLINE    rac2    
ora....ac2.vip ora....t1.type ONLINE    ONLINE    rac2    
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE              
ora....network ora....rk.type ONLINE    ONLINE    rac1    
ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    rac1    
ora.ons        ora.ons.type   ONLINE    ONLINE    rac1    
ora....ry.acfs ora....fs.type ONLINE    ONLINE    rac1    
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    rac1到这里为止,OCR和