11G OCR、VOTING损坏恢复

概述:11Grac经常有会碰到ora.cssd,ora.crsd进程启动失败。一般css.d进程失败多是由于voting盘损坏或者voting盘数量不足导致,而crsd进程失败多是OCR损坏或者集群的配置信息损坏

1.OCR一般默认4个小时备份一次,在备份文件位置处,至少存在5份OCR备份信息,最近4小时生成的OCR,最近一天生成的一份备份,最近一周的一份备份

[grid@rac1 rac1]$ ocrconfig -showbackup
rac2     2015/08/25 14:54:37     /u01/app/11.2/grid/cdata/rac-cluster/backup00.ocr

rac2     2015/08/24 21:12:34     /u01/app/11.2/grid/cdata/rac-cluster/backup01.ocr

rac2     2015/08/24 17:12:34     /u01/app/11.2/grid/cdata/rac-cluster/backup02.ocr

rac2     2015/08/24 13:12:33     /u01/app/11.2/grid/cdata/rac-cluster/day.ocr

rac1     2015/08/13 13:12:12     /u01/app/11.2/grid/cdata/rac-cluster/week.ocr

2.手动备份OCR信息

[root@rac1 grid]# ocrconfig -showbackup

rac2     2015/08/25 14:54:37     /u01/app/11.2/grid/cdata/rac-cluster/backup00.ocr

rac2     2015/08/24 21:12:34     /u01/app/11.2/grid/cdata/rac-cluster/backup01.ocr

rac2     2015/08/24 17:12:34     /u01/app/11.2/grid/cdata/rac-cluster/backup02.ocr

rac2     2015/08/24 13:12:33     /u01/app/11.2/grid/cdata/rac-cluster/day.ocr

rac1     2015/08/13 13:12:12     /u01/app/11.2/grid/cdata/rac-cluster/week.ocr

rac1     2015/08/28 09:09:18     /u01/app/11.2/grid/cdata/rac-cluster/backup_20150828_090918.ocr

3.模拟ocr盘损坏

检查ocr、voting所使用的盘

[root@rac1 grid]# crsctl query css votedisk
STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
1. ONLINE   1da20ec3577a4fa9bf2882a391d66afb (/dev/raw/raw1) [DATA]

模拟损坏OCR盘

dd if=/dev/zero of=/dev/raw/raw1 bs=4K count=100

4 启动集群,打开集群的alert日志

[grid@rac1 ~]$ cd /u01/app/11.2/grid/log/rac1/
[grid@rac1 rac1]$ tail -f alertrac1.log
[root@rac1 grid]# crsctl start cluster -all

我们可以看到有以下报错:

2015-08-28 09:22:17.471: 
[ohasd(1990)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2015-08-28 09:22:17.471: 
[ohasd(1990)]CRS-2769:Unable to failover resource 'ora.diskmon'.
2015-08-28 09:22:30.845: 
[cssd(7243)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2/grid/log/rac1/cssd/ocssd.log

即集群找不到voting盘文件,我们知道ocr记录的是集群配置信息,这也于我们dd掉ocr盘预期的结果相符

以下步骤用来恢复ocr,重新启动集群:

5 停止所有节点集群

[root@rac1 grid]# crsctl stop crs -f

如果无法停止,可以使用以下方式:

ps -elf | egrep "PID|d.bin|ohas|oraagent|orarootagent|cssdagent|cssdmonitor" | grep -v grep

上面这种方式需要对查询出来的PID手动kill -9

ps -elf | egrep "d.bin|ohas|oraagent|orarootagent|cssdagent|cssdmonitor" | grep -v grep |awk '{print $4}' |xargs -n 10  kill -9

通过以下方式检查确认集群停止成功

[root@rac1 grid]# ps -ef|grep crs
root      9229  4909  0 09:42 pts/0    00:00:00 grep crs
[root@rac1 grid]# ps -ef|grep css
root      9231  4909  0 09:42 pts/0    00:00:00 grep css
[root@rac1 grid]# ps -ef|grep evm
root      9236  4909  0 09:42 pts/0    00:00:00 grep evm
[root@rac1 grid]# ps -ef|grep ohas
root      9204     1  0 09:41 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run
root      9239  4909  0 09:42 pts/0    00:00:00 grep ohas

6 以独占模式启动crs

[root@rac1 grid]# crsctl start crs -excl -nocrs
CRS-4123: Oracle High Availability Services has been started.
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1'
CRS-2676: Start of 'ora.mdnsd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1'
CRS-2676: Start of 'ora.gpnpd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1'
CRS-2672: Attempting to start 'ora.gipcd' on 'rac1'
CRS-2676: Start of 'ora.cssdmonitor' on 'rac1' succeeded
CRS-2676: Start of 'ora.gipcd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac1'
CRS-2676: Start of 'ora.diskmon' on 'rac1' succeeded
CRS-2676: Start of 'ora.cssd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'rac1'
CRS-2679: Attempting to clean 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'
CRS-2681: Clean of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2676: Start of 'ora.drivers.acfs' on 'rac1' succeeded
CRS-2676: Start of 'ora.ctssd' on 'rac1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'rac1'
CRS-2676: Start of 'ora.asm' on 'rac1' succeeded

说明:

-excl 该参数指定使用独占模式

-nocrs 该参数指定忽略查找crs及voting

此时集群状态:

[grid@rac1 trace]$ crsctl stat res -t -init
--------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
----------------------------------------------------------------
Cluster Resources
------------------------------------------------------------------------
ora.asm
  1        ONLINE  INTERMEDIATE rac1                     OCR not started     
ora.cluster_interconnect.haip
  1        ONLINE  ONLINE       rac1                                         
ora.crf
  1        OFFLINE OFFLINE                                                   
ora.crsd
  1        OFFLINE OFFLINE                                                   
ora.cssd
  1        ONLINE  ONLINE       rac1                                         
ora.cssdmonitor
  1        ONLINE  ONLINE       rac1                                         
ora.ctssd
  1        ONLINE  ONLINE       rac1                     ACTIVE:0            
ora.diskmon
  1        OFFLINE OFFLINE                                                   
ora.drivers.acfs
  1        ONLINE  ONLINE       rac1                                         
ora.evmd
  1        OFFLINE OFFLINE                                                   
ora.gipcd
  1        ONLINE  ONLINE       rac1                                         
ora.gpnpd
  1        ONLINE  ONLINE       rac1                                         
ora.mdnsd
  1        ONLINE  ONLINE       rac1

7 重新创建ocrvoting

SQL> create diskgroup data external redundancy disk '/dev/raw/raw1' attribute 'au_size'='1M','compatible.asm' = '11.2.0','compatible.rdbms' = '11.2.0';

注意此处的ocrvote的名字一定要和损坏之前的一致,否则在恢复ocrvote的时候会报错:

 PROT-35: The configured OCR locations are not accessible.

8 利用备份恢复OCR

[root@rac1 bin]# ./ocrconfig -restore /u01/app/11.2/grid/cdata/rac-cluster/backup00.ocr

可以用以下命令检查:

cluvfy comp ocr -n all
ocrcheck

9 恢复vote盘

[root@rac1 bin]# ./crsctl replace votedisk +DATA
Successful addition of voting disk 4201f39953204fbdbf2b502ef4abe9cb.
Successfully replaced voting disk group with +DATA.
CRS-4266: Voting file(s) successfully replaced

注意此处可能会报错:

crsctl replace votedisk +ocrvote
CRS-4602: Failed 27 to add voting file 5a71f4b0868e4f8abfc4808566c5c7fa.
CRS-4602: Failed 27 to add voting file 66699f04c8a74f57bf08e0682294e449.
CRS-4602: Failed 27 to add voting file 7181a4d009884fecbff2cab4c69f2de2.
Failed to replace voting disk group with +ocrvote.
CRS-4000: Command Replace failed, or completed with errors.

可以用以下方式解决:

SQL> show parameter disk

NAME                                 TYPE
------------------------------------ ----------------------
VALUE
------------------------------
asm_diskgroups                       string
OCRVOTE
asm_diskstring                       string

SQL> alter system set asm_diskstring='/dev/raw/*';

然后重新执行命令恢复vote盘

检查确认:

[grid@rac1 ~]$ crsctl query css votedisk

10 重新创建spfile

注意,如何集群asm所使用的spfile放在了ocr共享盘,此处需要重新创建,方式有两种:

1) 利用11g的特性

create spfile from memory

2) 手动创建

root@rac2 ~]# vi /tmp/asm_pfile.txt

加入如下参数:

 *.asm_power_limit=1
 *.diagnostic_dest='/u01/app/grid/11.2.0/log'
 *.instance_type='asm'
 *.large_pool_size=12M
 *.remote_login_passwordfile='EXCLUSIVE'

利用我们自己编辑的文档重新创建spfile

SQL> create spfile='+DATA' from pfile='/tmp/asm_pfile.txt';

11 关闭集群,重启集群:

[root@rac1 grid]# crsctl stop crs -f

[root@rac1 grid]# crsctl start crs

[root@rac1 grid]# crsctl start cluster -all

12 检查集群资源状态:

1)集群信息

[root@rac1 grid]# crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resource

ora.DATA.dg
           ONLINE  ONLINE       rac1                                         
           ONLINE  ONLINE       rac2                                         
ora.LISTENER.lsnr
           ONLINE  ONLINE       rac1                                         
           ONLINE  ONLINE       rac2                                         
ora.ORADATA.dg
           ONLINE  ONLINE       rac1                                         
           ONLINE  ONLINE       rac2                                         
 ora.asm
           ONLINE  ONLINE       rac1                     Started             
           ONLINE  ONLINE       rac2                     Started             
ora.gsd
           OFFLINE OFFLINE      rac1                                         
           OFFLINE OFFLINE      rac2                                         
ora.net1.network
           ONLINE  ONLINE       rac1                                         
           ONLINE  ONLINE       rac2                                         
ora.ons
           ONLINE  ONLINE       rac1                                         
           ONLINE  ONLINE       rac2                                         
ora.registry.acfs
           ONLINE  ONLINE       rac1                                         
           ONLINE  ONLINE       rac2                                         
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
  1        ONLINE  ONLINE       rac1                                         
ora.cvu
  1        ONLINE  ONLINE       rac1                                         
ora.oc4j
  1        ONLINE  ONLINE       rac1                                         
ora.rac1.vip
  1        ONLINE  ONLINE       rac1                                         
ora.rac2.vip
  1        ONLINE  ONLINE       rac2                                         
ora.scan1.vip
  1        ONLINE  ONLINE       rac1                                         
ora.sunny.db
  1        ONLINE  ONLINE       rac1                     Open                
  2        ONLINE  ONLINE       rac2                     Open

2)检查ocr vote信息

[root@rac1 grid]# ocrcheck
Status of Oracle Cluster Registry is as follows :
     Version                  :          3
     Total space (kbytes)     :     262120
     Used space (kbytes)      :       3084
     Available space (kbytes) :     259036
     ID                       :  101930821
     Device/File Name         :      +DATA
                                Device/File integrity check succeeded

                                Device/File not configured

                                Device/File not configured

                                Device/File not configured

                                Device/File not configured

     Cluster registry integrity check succeeded

     Logical corruption check succeeded

3)检查spfile信息

SQL> show parameter spfile

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
spfile                               string      /u01/app/11.2/grid/dbs/spfile+
                                             ASM1.ora

4)检查DG是否正常

[grid@rac1 rac-cluster]$ asmcmd lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576      3082     2687                0            2687              0             Y  DATA/
MOUNTED  EXTERN  N         512   4096  1048576      8197     5802                0            5802              0             N  ORADATA/

到此恢复完成,更spfile的位置到DATA里面:

SQL> create pfile='/tmp/aa.txt' from spfile;

File created.

SQL> create spfile='+DATA' from pfile='/tmp/aa.txt';

File created.

关闭集群,重新启动集群:

crsctl stop cluster -all

crsctl start cluster -all

13 知识拓展

1)关于export 和 import 手工备份OCR:

[root@rac1 rac-cluster]# ocrconfig -manualbackup

可以使用import参数到处ocr信息,其也可以用了恢复ocr

ocrconfig -export /tmp/ocr.bak
ocrconfig -import file_name

如果使用export的ocr备份恢复ocr盘,不可以使用restore参数,需要使用 -import参数

2)关于利用kfed命令读取磁盘头,获得ocr在磁盘位置信息

[root@rac1 rac-cluster]# kfed read  /dev/raw/raw1 | grep -E 'vfstart|vfend'

kfdhdb.vfstart:                     320 ; 0x0ec: 0x00000140
kfdhdb.vfend:                       352 ; 0x0f0: 0x00000160

特别:对于没有备份信息恢复ocr,只能才去重建方式,所以日常工作中一定要注意检查ocr的备份信息