案例背景
4节点Extend RAC,存储RAID 5校验异常,修复之后SOLDATA磁盘组无法mount,报错ORA-15096: lost disk write detected。
mount SOLDATA磁盘组ASM日志:
Fri Sep 25 00:31:57 2020
NOTE: GMON heartbeating for grp 2 (SOLDATA)
GMON querying group 2 at 5 for pid 27, osid 187323
Fri Sep 25 00:31:57 2020
NOTE: cache is mounting group SOLDATA created on 2019/04/12 15:10:32
NOTE: cache opening disk 0 of grp 2: SOLDATA_0000 path:/dev/emcpowerb
NOTE: group 2 (SOLDATA) high disk header ckpt advanced to fcn 0.714
NOTE: 09/25/20 00:31:57 SOLDATA.F1X0 found on disk 0 au 10 fcn 0.714 datfmt 2
NOTE: cache opening disk 1 of grp 2: SOLDATA_0001 path:/dev/emcpowerc
NOTE: cache opening disk 2 of grp 2: SOLDATA_0002 path:/dev/emcpowerd
NOTE: cache opening disk 3 of grp 2: SOLDATA_0003 path:/dev/emcpowerh
Fri Sep 25 00:31:57 2020
NOTE: cache mounting (first) external redundancy group 2/0xB82BB917 (SOLDATA)
Fri Sep 25 00:31:57 2020
* allocate domain 2, invalid = TRUE
kjbdomatt send to inst 2
Fri Sep 25 00:31:57 2020
NOTE: attached to recovery domain 2
Fri Sep 25 00:31:57 2020
NOTE: crash recovery of group SOLDATA will recover thread=1 ckpt=28.3507 domain=2 inc#=2 instnum=2
NOTE: crash recovery of group SOLDATA will recover thread=2 ckpt=39.8576 domain=2 inc#=4 instnum=1
NOTE: crash recovery of group SOLDATA will recover thread=3 ckpt=21.9043 domain=2 inc#=6 instnum=4
NOTE: crash recovery of group SOLDATA will recover thread=4 ckpt=22.6878 domain=2 inc#=12 instnum=3
* validated domain 2, flags = 0x0
NOTE: BWR validation signaled ORA-15096
Fri Sep 25 00:31:57 2020
Errors in file /u01/product/grid/crs/diag/asm/+asm/+ASM1/trace/+ASM1_ora_187323.trc:
ORA-15096: lost disk write detected
NOTE: crash recovery signalled OER-15096
ERROR: ORA-15096 signalled during mount of diskgroup SOLDATA
查看ORA-15096的描述,官方提供的action还是比较悲观的。
[grid@lx1 ~]$ oerr ora 15096
15096, 0000, "lost disk write detected"
// *Cause: A failure either by disk hardware or disk software caused a disk
// write to to be lost, even though ASM received acknowledgement that
// the write completed. Alternatively, a clustering hardware failure
// or a clustering software failure resulted in an ASM instance
// believing that another ASM instance had crashed, when in fact it
// was still active.
// *Action: The disk group is corrupt and cannot be recovered. The disk group
// must be recreated, and its contents restored from backups.
KFED读取4个thread的acd checkpoint分别为:
thread 1(inst_id 2) acdc:
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 7 ; 0x002: KFBTYP_ACDC
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 3 ; 0x008: file=3
kfbh.check: 2236936757 ; 0x00c: 0x8554f235
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
kfracdc.eyec[0]: 65 ; 0x000: 0x41
kfracdc.eyec[1]: 67 ; 0x001: 0x43
kfracdc.eyec[2]: 68 ; 0x002: 0x44
kfracdc.eyec[3]: 67 ; 0x003: 0x43
kfracdc.thread: 1 ; 0x004: 0x00000001
kfracdc.lastAba.seq: 4294967295 ; 0x008: 0xffffffff
kfracdc.lastAba.blk: 4294967295 ; 0x00c: 0xffffffff
kfracdc.blk0: 1 ; 0x010: 0x00000001
kfracdc.blks: 10751 ; 0x014: 0x000029ff
kfracdc.ckpt.seq: 28 ; 0x018: 0x0000001c
kfracdc.ckpt.blk: 3507 ; 0x01c: 0x00000db3
kfracdc.fcn.base: 5091911 ; 0x020: 0x004db247
kfracdc.fcn.wrap: 0 ; 0x024: 0x00000000
kfracdc.bufBlks: 256 ; 0x028: 0x00000100
kfracdc.strt112.seq: 2 ; 0x02c: 0x00000002
kfracdc.strt112.blk: 0 ; 0x030: 0x00000000
thread 2(inst_id 1) acdc:
kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 7 ; 0x002: KFBTYP_ACDC
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 10752 ; 0x004: blk=10752
kfbh.block.obj: 3 ; 0x008: file=3
kfbh.check: 3866731362 ; 0x00c: 0xe