一、故障概述
因为数据库服务器重启出现OCR盘丢失导致oracle数据库集群宕机。
二、故障重现
在测试环境破坏保存OCR信息的磁盘组+CVD
- 使用dd清除OCR的磁盘头信息(因为OCR是normal,所以掉了两个才无法启动)
[root@rac1 ~]# dd if=/dev/zero of=/dev/oracleasm/disks/CVD01 bs=1024 count=1000
[root@rac1 ~]# dd if=/dev/zero of=/dev/oracleasm/disks/CVD02 bs=1024 count=1000
- 打开集群的alert日志并打开集群观察日志
[grid@rac1 ~]$ tail -200f /u01/app/11.2/grid/log/rac1/alertrac1.log
[root@rac1 grid]# crsctl start crs
然后发现日志出现以下报错信息:
说明集群找不到voting盘文件,ocr记录的是集群配置信息
使用ocrcheck检测报错也无法显示信息:
[root@rac1 ~]# ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage
三、故障恢复
- 强制停止crs
[root@rac1 ~]# crsctl stop crs –f
- 检查确认集群停止成功
[root@rac1 ~]# ps -ef|grep crs
[root@rac1 ~]# ps -ef|grep css
[root@rac1 ~]# ps -ef|grep ohas
[root@rac1 ~]# ps -ef|grep evm
- 以独占模式启动crs
[root@rac1 grid]# crsctl start crs -excl -nocrs
CRS-4123: Oracle High Availability Services has been started.
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1'
CRS-2676: Start of 'ora.mdnsd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1'
CRS-2676: Start of 'ora.gpnpd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1'
CRS-2672: Attempting to start 'ora.gipcd' on 'rac1'
CRS-2676: Start of 'ora.cssdmonitor' on 'rac1' succeeded
CRS-2676: Start of 'ora.gipcd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac1'
CRS-2676: Start of 'ora.diskmon' on 'rac1' succeeded
CRS-2676: Start of 'ora.cssd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'rac1'
CRS-2679: Attempting to clean 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'
CRS-2681: Clean of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2676: Start of 'ora.drivers.acfs' on 'rac1' succeeded
CRS-2676: Start of 'ora.ctssd' on 'rac1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'rac1'
CRS-2676: Start of 'ora.asm' on 'rac1' succeeded
说明:
-excl 该参数指定使用独占模式
-nocrs 该参数指定忽略查找crs及voting
- 创建CVD磁盘组以及spfile
[grid@rac1 ~]$ asmcmd
ASMCMD> ls
空的,因为spfile文件存在CVD磁盘组当中,spfile也因此丢失,读不到DATA和ARCH磁盘组,需要重建spfile参数文件。
1)创建一个pfile参数文件
[grid@rac1 ~]$ cat /tmp/asm_pfile.ora
+ASM1.asm_diskgroups='ARCHASMGROUP', 'DATAASMGROUP'#Manual Mount
+ASM2.asm_diskgroups='ARCHASMGROUP', 'DATAASMGROUP'#Manual Mount
*.asm_diskstring='/dev/oracleasm/disks/*'
*.asm_power_limit=1
*.diagnostic_dest='/u01/app/grid'
*.instance_type='asm'
*.large_pool_size=12M
*.remote_login_passwordfile='EXCLUSIVE'
2)关闭crs
[root@rac1 ~]# crsctl stop crs
[root@rac1 ~]# crsctl start crs -excl –nocrs
3)使用参数文件启动ASM
[grid@rac1 ~]$ sqlplus / as sysasm
SQL>shutdown immediate
ASM diskgroups dismounted
ASM instance shutdown
SQL> startup pfile='/tmp/asm_pfile.ora';
ASM instance started
SQL> col path for a50
SQL> set linesize 170
SQL> select group_number, disk_number, mount_status, header_status, path from v$asm_disk;
4)创建CRS磁盘组(注意磁盘组和原先的一致,否则在后续恢复的时候会报错)
SQL> create diskgroup CVD normal redundancy disk '/dev/oracleasm/disks/CVD1','/dev/oracleasm/disks/CVD2', '/dev/oracleasm/disks/CVD3' attribute 'au_size'='1M' 'compatible.asm'='11.2.0.0.0', 'compatible.rdbms'='11.2.0.0.0';
create diskgroup CVD normal redundancy disk '/dev/oracleasm/disks/CVD01','/dev/oracleasm/disks/CVD02', '/dev/oracleasm/disks/CVD03'
*
ERROR at line 1:
ORA-15018: diskgroup cannot be created
ORA-15033: disk '/dev/oracleasm/disks/CVD3' belongs to diskgroup "CVD" --这里报错是因为CVD3没清除磁盘头信息
5)清除CVD03磁盘头信息
[root@rac1 dev]# dd if=/dev/zero of=/dev/oracleasm/disks/CVD03 bs=1024 count=1000
SQL> create diskgroup CVD normal redundancy disk '/dev/oracleasm/disks/CVD01','/dev/oracleasm/disks/CVD02', '/dev/oracleasm/disks/CVD03' attribute 'au_size'='1M' 'compatible.asm'='11.2.0.0.0', 'compatible.rdbms'='11.2.0.0.0';
Diskgroup created.
6)创建spfile参数文件
SQL> create spfile='+CVD' from pfile='/tmp/asm_pfile.ora';
File created.
7)查看参数文件已经存在CVD盘中
[grid@rac1 ~]$ asmcmd
ASMCMD> ls
CVD/
ARCHASMGROUP/
DATAASMGROUP/
ASMCMD> ls CVD/
rac
ASMCMD> ls CVD/rac
ASMPARAMETERFILE/
ASMCMD> ls CVD/rac/ASMPARAMETERFILE
REGISTRY.253.941399897
- 利用备份恢复OCR
[root@rac1 ~]# ocrconfig -restore /u01/app/11.2/grid/cdata/rac/backup00.ocr
- 恢复voting disk
[root@rac1 ~]# crsctl replace votedisk +CVD
Successful addition of voting disk 429d9c75-6a804fdf-bfeb8f1d-fa8358e9.
Successful addition of voting disk af38c76f-9d474f1f-bfa5c594-64fb0368.
Successful addition of voting disk 185a41992e3d4f57bf86248889c08561.
Successfully replaced voting disk group with +CVD.
CRS-4266: Voting file(s) successfully replaced
- 检测
[root@rac1 ~]# ocrcheck
[root@rac1 ~]# crsctl query css votedisk
- 停止CRS已正常的方式启动
[root@rac1 ~]# crsctl stop crs
[root@rac1 ~]# crsctl start crs
[root@rac2 ~]# crsctl start crs
[root@rac1 ~]# crsctl stat res -t
四、检查恢复环境
- 集群信息
[root@rac1 ~]# crsctl stat res -t
- 检查ASM spfile信息
SQL> show parameter spfile
- 检查ocr vote信息
[root@rac1 ~]# ocrcheck
- 检查DG是否正常
SQL> archive log list
SQL> select count(*) from v$archived_gap;
SQL> select process,status,thread#,sequence# from v$managed_standby;
SQL> select THREAD#,SEQUENCE#,ARCHIVED,APPLIED from v$archived_log;