ASM磁盘组ORA-15042故障处理

最新推荐文章于 2021-11-23 17:30:00 发布

P10ZHUO

最新推荐文章于 2021-11-23 17:30:00 发布

阅读量1.7k

点赞数

分类专栏： asm 文章标签：数据库

本文链接：https://blog.csdn.net/fanzhuozhuo/article/details/109601228

版权

asm 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

ASM磁盘组ORA-15042故障处理

构建NORMAL冗余磁盘组
确定PST位置
- external
模拟故障现象
故障原因
解决方案

NORMAL磁盘组中有1个failgroup意外offline(如现在市面上的一体机1个存储节点意外重启)，在这个failgroup恢复回来重新成功online之前，另外一个failgroup中有一块磁盘损坏了，此时悲剧就发生了，即使被offline的failgroup还原回来，也不能mount磁盘组。因为我们之前介绍的ASM重要元数据PST认为这些盘的状态不是可正常访问的。

构建NORMAL冗余磁盘组

SQL> select GRPNUM_KFDSK,NUMBER_KFDSK,MODE_KFDSK,FAILNAME_KFDSK,PATH_KFDSK from x$kfdsk where GRPNUM_KFDSK=2;
GRPNUM_KFDSK NUMBER_KFDSK MODE_KFDSK FAILNAME_KFDSK                 PATH_KFDSK
------------ ------------ ---------- ------------------------------ --------------------
           2            0        127 FG1                            /dev/asm_arch01
           2            1        127 FG1                            /dev/asm_arch02
           2            2        127 FG2                            /dev/asm_arch03
           2            3        127 FG2                            /dev/asm_arch04
           2            4        127 FG3                            /dev/asm_arch05
           2            5        127 FG3                            /dev/asm_arch06

6 rows selected.

为了便于观察恢复效果，跟踪某条记录的变化，在offline primary extent所在磁盘后，更新这条数据，然后破坏其secondary extent所在磁盘，最后验证该事务是否丢失。
这里手动创建一张test的测试表，并查看其中一行记录物理存放位置。

SQL> create tablespace test datafile '+archdg' size 1G;

Tablespace created.

SQL> alter user zhuo quota unlimited on test;

User altered.

SQL> create table zhuo.test tablespace test as select * from dba_objects;

Table created.

select  object_id,object_name,
dbms_rowid.rowid_relative_fno(rowid) rel_fno#,
dbms_rowid.rowid_block_number(rowid) block#
from zhuo.test where rownum=1;
 OBJECT_ID OBJECT_NAM   REL_FNO#     BLOCK#
---------- ---------- ---------- ----------
        20 ICOL$               6        131

通过脚本找到数据块与ASM磁盘的映射关系，由于是normal冗余，此处会看到两副本，LXN_KFFXP为0的是primary extent在0号disk上，为1的是secondary extent在4号disk上，稍后我们就模拟offline 0号disk所在fg，并且破坏4号盘

SQL> @/tmp/asm_block
Enter value for block: 131
Enter value for file_number: 256
Enter value for file_type: datafile
Enter value for filename: TEST.256.1056118191

                DATAFILE                             PHYSICS                              DISK_NUMBER                EXTENT                  EXTENT       EXTENT
GROUP_NAME      NUMBER_NAME                    EXTENT_NUMBER EXTENT_NUMBER LOGICAL_NUMBER DISK_NMAE                  NUMBER DISK_BLOCK  BEGIN_BLOCK    END_BLOCK   TOTAL_AU BEGIN_BLOC END_BLOCK
--------------- ------------------------------ ------------- ------------- -------------- ------------------------- ------- ---------- ------------ ------------ ---------- ---------- ---------
ARCHDG          256.TEST.256.1056118191                    3             1              1 4.ARCHDG_0004                  33       4099         4224         4351          2 256              255
ARCHDG          256.TEST.256.1056118191                    2             1              0 0.ARCHDG_0000                  36       4611         4608         4735          2 128              255

确定PST位置

1）asm alert日志：

Tue Nov 10 14:03:24 2020
SQL> CREATE DISKGROUP archdg NORMAL REDUNDANCY  FAILGROUP FG1 DISK '/dev/asm_arch01' SIZE 3072M ,
'/dev/asm_arch02' SIZE 2048M  FAILGROUP FG2 DISK '/dev/asm_arch03' SIZE 2048M ,
'/dev/asm_arch04' SIZE 1024M  FAILGROUP FG3 DISK '/dev/asm_arch05' SIZE 1024M ,
'/dev/asm_arch06' SIZE 1024M  ATTRIBUTE 'compatible.asm'='11.2.0.0.0','compatible.rdbms'='11.2','au_size'='1M' /* ASMCA */ 
NOTE: Assigning number (2,0) to disk (/dev/asm_arch01)     ---为磁盘分配disk number
NOTE: Assigning number (2,1) to disk (/dev/asm_arch02)
NOTE: Assigning number (2,2) to disk (/dev/asm_arch03)
NOTE: Assigning number (2,3) to disk (/dev/asm_arch04)
NOTE: Assigning number (2,4) to disk (/dev/asm_arch05)
NOTE: Assigning number (2,5) to disk (/dev/asm_arch06)
NOTE: initializing header on grp 2 disk ARCHDG_0000
NOTE: initializing header on grp 2 disk ARCHDG_0001
NOTE: initializing header on grp 2 disk ARCHDG_0002
NOTE: initializing header on grp 2 disk ARCHDG_0003
NOTE: initializing header on grp 2 disk ARCHDG_0004
NOTE: initializing header on grp 2 disk ARCHDG_0005
Tue Nov 10 14:03:24 2020
GMON updating for reconfiguration, group 2 at 5 for pid 19, osid 3478
NOTE: group 2 PST updated.
NOTE: initiating PST update: grp = 2
GMON updating group 2 at 6 for pid 19, osid 3478
NOTE: group ARCHDG: initial PST location: disk 0000 (PST copy 0)    ---位置1
NOTE: group ARCHDG: initial PST location: disk 0002 (PST copy 1)    ---位置2
NOTE: group ARCHDG: initial PST location: disk 0004 (PST copy 2)    ---位置3
NOTE: PST update grp = 2 completed successfully 
。。。。。。。

2)kfed读取

[grid@11gasm ~]$ kfed read /dev/asm_arch01 aun=1 |grep type
kfbh.type:                           17 ; 0x002: KFBTYP_PST_META
[grid@11gasm ~]$ kfed read /dev/asm_arch02 aun=1 |grep type 
kfbh.type:                           13 ; 0x002: KFBTYP_PST_NONE
[grid@11gasm ~]$ kfed read /dev/asm_arch03 aun=1 |grep type 
kfbh.type:                           17 ; 0x002: KFBTYP_PST_META
[grid@11gasm ~]$ kfed read /dev/asm_arch04 aun=1 |grep type 
kfbh.type:                           13 ; 0x002: KFBTYP_PST_NONE
[grid@11gasm ~]$ kfed read /dev/asm_arch05 aun=1 |grep type 
kfbh.type:                           17 ; 0x002: KFBTYP_PST_META
[grid@11gasm ~]$ kfed read /dev/asm_arch06 aun=1 |grep type 
kfbh.type:                           13 ; 0x002: KFBTYP_PST_NONE

type=KFBTYP_PST_META，表示此处为PST元数据头
3)gmon trace

*** 2020-11-10 14:03:24.620
GMON updating for reconfiguration, group 2 at 5 for pid 19, osid 3478
GMON updating group 2 at 6 for pid 19, osid 3478
NOTE: GMON selects PST disk ARCHDG_0000 in failgroup FG1
NOTE: GMON selects PST disk ARCHDG_0002 in failgroup FG2
NOTE: GMON selects PST disk ARCHDG_0004 in failgroup FG3
PRE
=============== PST ==================== 
grpNum:    2                                 ---磁盘组号
state:     1 
callCnt:   6 
(lockvalue) valid=0 ver=0.0 ndisks=0 flags=0x2 from inst=0 (I am 1) last=0
--------------- HDR -------------------- 
next:    1 
last:    1 
pst count:       3                            ---PST个数
pst locations:   0  2  4                   ---所在磁盘号，disk number
incarn:          1 
dta size:        6 
version:         1 
ASM version:     186646528 = 11.2.0.0.0
contenttype:     1
partnering pattern:      [ ]
--------------- LOC MAP ---------------- 
0: dirty 0       cur_loc: 0      stable_loc: 0
1: dirty 0       cur_loc: 0      stable_loc: 0
--------------- DTA -------------------- 
0: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 
1: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 
2: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 
3: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 
4: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 
5: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts:
。。。。。。。。。

external

模拟故障现象

offline fg1

fg1为primary extent所在的failgroup，此时手动offline，模拟生产环境的服务器关机。
SQL> alter diskgroup archdg offline disks in failgroup fg1;

Diskgroup altered.

通过以下信息，可以看出PST已经发生变化
asm alert：

Tue Nov 10 14:14:35 2020
SQL> alter diskgroup archdg offline disks in failgroup fg1 
NOTE: DRTimer CD Create:  for disk group 2 disks:
 0
 1
NOTE: process _user5251_+asm (5251) initiating offline of disk 0.3915950070 (ARCHDG_0000) with mask 0x7e in group 2
NOTE: process _user5251_+asm (5251) initiating offline of disk 1.3915950068 (ARCHDG_0001) with mask 0x7e in group 2
NOTE: initiating PST update: grp = 2, dsk = 0/0xe968a7f6, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 2, dsk = 1/0xe968a7f4, mask = 0x6a, op = clear
Tue Nov 10 14:14:35 2020
GMON updating disk modes for group 2 at 13 for pid 18, osid 5251
WARNING: GMON has insufficient disks to maintain consensus. Minimum required is 2: updating 2 PST copies from a total of 3.                               ---PST个数发生变化
NOTE: group ARCHDG: updated PST location: disk 0002 (PST copy 0)         
NOTE: group ARCHDG: updated PST location: disk 0004 (PST copy 1)
NOTE: PST update grp = 2 completed successfully 
NOTE: initiating PST update: grp = 2, dsk = 0/0xe968a7f6, mask = 0x7e, op = clear
NOTE: initiating PST update: grp = 2, dsk = 1/0xe968a7f4, mask = 0x7e, op = clear
GMON updating disk modes for group 2 at 14 for pid 18, osid 5251
NOTE: group ARCHDG: updated PST location: disk 0002 (PST copy 0)     ---现在PST直在磁盘0和1上面
NOTE: group ARCHDG: updated PST location: disk 0004 (PST copy 1)
NOTE: cache closing disk 0 of grp 2: ARCHDG_0000
NOTE: cache closing disk 1 of grp 2: ARCHDG_0001
NOTE: PST update grp = 2 completed successfully 
NOTE: DRTimer CD Destroy: for diskgroup 2
SUCCESS: alter diskgroup archdg offline disks in failgroup fg1

此时gmon trace：

*** 2020-11-10 14:14:35.384
GMON updating disk modes for group 2 at 13 for pid 18, osid 5251
  dsk = 0/0xe968a7f6, mask = 0x6a, op = clear
  dsk = 1/0xe968a7f4, mask = 0x6a, op = clear
PRE
=============== PST ==================== 
grpNum:    2 
state:     1 
callCnt:   13 
(lockvalue) valid=1 ver=0.0 ndisks=3 flags=0x3 from inst=0 (I am 1) last=0
--------------- HDR -------------------- 
next:    2 
last:    2 
pst count:       3 
pst locations:   0  2  4 
incarn:          1 
dta size:        6 
version:         1 
ASM version:     186646528 = 11.2.0.0.0
contenttype:     1
partnering pattern:      [ ]
--------------- LOC MAP ---------------- 
0: dirty 0       cur_loc: 0      stable_loc: 0
1: dirty 0       cur_loc: 1      stable_loc: 1
--------------- DTA -------------------- 
0: sts v v(rw) p(rw) a(x) d(x) fg# = 1 addTs = 2463424728 parts: 2 (amp) 4 (amp) 3 (amp) 5 (amp) 
1: sts v v(rw) p(rw) a(x) d(x) fg# = 1 addTs = 2463424728 parts: 3 (amp) 5 (amp) 2 (amp) 4 (amp) 
2: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2463424728 parts: 0 (amp) 5 (amp) 1 (amp) 4 (amp) 
3: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2463424728 parts: 4 (amp) 1 (amp) 0 (amp) 5 (amp) 
4: sts v v(rw) p(rw) a(x) d(x) fg# = 3 addTs = 2463424728 parts: 3 (amp) 0 (amp) 1 (amp) 2 (amp) 
5: sts v v(rw) p(rw) a(x) d(x) fg# = 3 addTs = 2463424728 parts: 1 (amp) 2 (amp) 3 (amp) 0 (amp) 
--------------- HBEAT ------------------ 
kfdpHbeat_dump: state=1, inst=1, ts=33107278.974040064, 
        rnd=3384698757.382747514.4006691341.2819570582.
kfk io-queue:    0x7ffff5d239d0
kfdpHbeatCB_dump: at 0x7ffff5d239c0 with ts=11/10/2020 14:14:32 iop=0x7ffff5d239d0, grp=2, disk=0/3915950070, isWrite=1 Hbeat #99 state=2 iostate=4
kfdpHbeatCB_dump: at 0x7ffff5d237f8 with ts=11/10/2020 14:14:32 iop=0x7ffff5d23808, grp=2, disk=2/3915950069, isWrite=1 Hbeat #99 state=2 iostate=4
kfdpHbeatCB_dump: at 0x7ffff5d23630 with ts=11/10/2020 14:14:32 iop=0x7ffff5d23640, grp=2, disk=4/3915950067, isWrite=1 Hbeat #99 state=2 iostate=4
InvalLck (group 2) upgraded to X
WARNING: GMON has insufficient disks to maintain consensus. Minimum required is 2: updating 2 PST copies from a total of 3.     --PST位置已经发生变化。上面个数为3，下面已经更新为2
InvalLck (group 2) downgraded to S
POST res = 1 
=============== PST ==================== 
grpNum:    2 
state:     1 
callCnt:   13 
(lockvalue) valid=1 ver=0.0 ndisks=2 flags=0x3 from inst=0 (I am 1) last=0
--------------- HDR -------------------- 
next:    3 
last:    3 
pst count:       2 
pst locations:   2  4 
incarn:          2 
dta size:        6 
version:         1 
ASM version:     186646528 = 11.2.0.0.0
contenttype:     1
partnering pattern:      [ ]
--------------- LOC MAP ---------------- 
0: dirty 0       cur_loc: 0      stable_loc: 0
1: dirty 0       cur_loc: 0      stable_loc: 0
--------------- DTA -------------------- 
0: sts v v(-w) p(-w) a(-) d(-) fg# = 1 addTs = 2463424728 parts: 2 (amp) 4 (amp) 3 (amp) 5 (amp) 
1: sts v v(-w) p(-w) a(-) d(-) fg# = 1 addTs = 2463424728 parts: 3 (amp) 5 (amp) 2 (amp) 4 (amp) 
2: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2463424728 parts: 0 (amp) 5 (amp) 1 (amp) 4 (amp) 
3: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2463424728 parts: 4 (amp) 1 (amp) 0 (amp) 5 (amp) 
4: sts v v(rw) p(rw) a(x) d(x) fg# = 3 addTs = 2463424728 parts: 3 (amp) 0 (amp) 1 (amp) 2 (amp) 
5: sts v v(rw) p(rw) a(x) d(x) fg# = 3 addTs = 2463424728 parts: 1 (amp) 2 (amp) 3 (amp) 0 (amp)

更新数据

SQL> update zhuo.test set object_name='wahaha' where rownum=1;

1 row updated.

SQL> commit;

Commit complete.

select  object_id,object_name,
dbms_rowid.rowid_relative_fno(rowid) rel_fno#,
dbms_rowid.rowid_block_number(rowid) block#
from zhuo.test where rownum=1;
 OBJECT_ID OBJECT_NAM   REL_FNO#     BLOCK#
---------- ---------- ---------- ----------
        20 wahaha              6        131

手动破坏4号磁盘

这里采用的dd命令，如果在12C中开启afd后，dd命令会自动过滤，详细见另一篇blog。

[root@11gasm tmp]# dd if=/dev/zero of=/dev/asm_arch05 bs=4096 count=1 conv=notrunc
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.00123799 s, 3.3 MB/s

Disk Group Name  Fail Group      Path                              File Name                 Status       Status  Status  TYPE      File Size (MB) Used Size (MB) Pct. Used
---------------- --------------- --------------------------------- ------------------------- ------------ ------- ------- --------- -------------- -------------- ---------
ARCHDG           FG1                                               ARCHDG_0000               UNKNOWN      MISSING OFFLINE REGULAR            3,072            552     17.97
                                                                   ARCHDG_0001               UNKNOWN      MISSING OFFLINE REGULAR            2,048            371     18.12
                 ***************                                                                                                    -------------- --------------
                 TOTAL                                                                                                                       5,120            923

                 FG2             /dev/asm_arch03                   ARCHDG_0002               MEMBER       CACHED  ONLINE  REGULAR            2,048            500     24.41
                                 /dev/asm_arch04                   ARCHDG_0003               MEMBER       CACHED  ONLINE  REGULAR            1,024            257     25.10
                 ***************                                                                                                    -------------- --------------
                 TOTAL                                                                                                                       3,072            757

                 FG3             /dev/asm_arch05                   ARCHDG_0004               CANDIDATE    CACHED  ONLINE  REGULAR            1,024            284     27.73
                                 /dev/asm_arch06                   ARCHDG_0005               MEMBER       CACHED  ONLINE  REGULAR            1,024            281     27.44
                 ***************                                                                                                    -------------- --------------
                 TOTAL

此时asm alert里面有报错：

Tue Nov 10 14:15:51 2020
WARNING: Disk 0 (ARCHDG_0000) in group 2 will be dropped in: (12960) secs on ASM inst 1
WARNING: Disk 1 (ARCHDG_0001) in group 2 will be dropped in: (12960) secs on ASM inst 1

这里的12960与asm磁盘组的属性disk_repair_time 有关。

[grid@11gasm ~]$ asmcmd lsattr -G datadg -l
Name                     Value       
access_control.enabled   FALSE       
access_control.umask     066         
au_size                  4194304     
cell.smart_scan_capable  FALSE       
compatible.asm           11.2.0.0.0  
compatible.rdbms         10.1.0.0.0  
disk_repair_time         3.6h        
sector_size              512

3.6*3600=12960.其实就是这个值。
DISK_REPAIR_TIME的值指定了ASM保持磁盘offline状态的时间，超过这个时间之后，将该盘drop。和fast mirror resync特性一样，COMPATIBLE.ASM属性要设置为11.1或者更高。该属性只能使用alter diskgroup语法修改。

故障出现

SQL> alter diskgroup archdg online disks in failgroup fg1;

alter diskgroup archdg online disks in failgroup fg1
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15130: diskgroup "ARCHDG" is being dismounted
ORA-15335: ASM metadata corruption detected in disk group 'ARCHDG'
ORA-15130: diskgroup "ARCHDG" is being dismounted
ORA-15066: offlining disk "ARCHDG_0004" in group "ARCHDG" may result in a data loss
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483652] [0] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483652] [0] [0 != 1]

磁盘组已经无法mount：

SQL> select name,state from v$asm_diskgroup;

NAME                           STATE
------------------------------ -----------
DATADG                         MOUNTED
ARCHDG                         DISMOUNTED


SQL> alter diskgroup archdg mount;
alter diskgroup archdg mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "4" is missing from group number "2"
SQL> alter diskgroup archdg mount force;
alter diskgroup archdg mount force
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15066: offlining disk "4" in group "ARCHDG" may result in a data loss
ORA-15042: ASM disk "4" is missing from group number "2"

alert日志：

SQL> alter diskgroup archdg mount 
NOTE: cache registered group ARCHDG number=2 incarn=0x9f185750
NOTE: cache began mount (first) of group ARCHDG number=2 incarn=0x9f185750
NOTE: Assigning number (2,5) to disk (/dev/asm_arch06)
NOTE: Assigning number (2,3) to disk (/dev/asm_arch04)
NOTE: Assigning number (2,2) to disk (/dev/asm_arch03)
NOTE: Assigning number (2,1) to disk (/dev/asm_arch02)
NOTE: Assigning number (2,0) to disk (/dev/asm_arch01)
Tue Nov 10 14:19:07 2020
NOTE: group ARCHDG: updated PST location: disk 0002 (PST copy 0)
NOTE: group ARCHDG: updated PST location: disk 0005 (PST copy 1)
NOTE: cache closing disk 0 of grp 2: (not open) ARCHDG_0000
NOTE: cache closing disk 1 of grp 2: (not open) ARCHDG_0001
NOTE: group ARCHDG: updated PST location: disk 0002 (PST copy 0)
NOTE: group ARCHDG: updated PST location: disk 0005 (PST copy 1)
NOTE: cache closing disk 0 of grp 2: (not open) ARCHDG_0000
NOTE: cache closing disk 1 of grp 2: (not open) ARCHDG_0001
NOTE: GMON heartbeating for grp 2
GMON querying group 2 at 22 for pid 18, osid 5251
NOTE: group ARCHDG: updated PST location: disk 0002 (PST copy 0)
NOTE: group ARCHDG: updated PST location: disk 0005 (PST copy 1)
NOTE: cache closing disk 0 of grp 2: (not open) ARCHDG_0000
NOTE: cache closing disk 1 of grp 2: (not open) ARCHDG_0001
NOTE: Assigning number (2,4) to disk ()
GMON querying group 2 at 23 for pid 18, osid 5251
NOTE: group ARCHDG: updated PST location: disk 0002 (PST copy 0)    ---PST位置发生变化
NOTE: group ARCHDG: updated PST location: disk 0005 (PST copy 1)    ---磁盘位置变化
NOTE: cache closing disk 0 of grp 2: (not open) ARCHDG_0000
NOTE: cache closing disk 1 of grp 2: (not open) ARCHDG_0001
NOTE: cache dismounting (clean) group 2/0x9F185750 (ARCHDG) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 5251, image: oracle@11gasm (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 2/0x9F185750 (ARCHDG) 
NOTE: cache ending mount (fail) of group ARCHDG number=2 incarn=0x9f185750
NOTE: cache deleting context for group ARCHDG 2/0x9f185750
GMON dismounting group 2 at 24 for pid 18, osid 5251
NOTE: Disk ARCHDG_0000 in mode 0x1 marked for de-assignment
NOTE: Disk ARCHDG_0001 in mode 0x1 marked for de-assignment
NOTE: Disk ARCHDG_0002 in mode 0x7f marked for de-assignment
NOTE: Disk ARCHDG_0003 in mode 0x7f marked for de-assignment
NOTE: Disk  in mode 0x7f marked for de-assignment
NOTE: Disk ARCHDG_0005 in mode 0x7f marked for de-assignment
ERROR: diskgroup ARCHDG was not mounted
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "4" is missing from group number "2" 
ERROR: alter diskgroup archdg mount

通过gmon trace观察此时的PST分布情况：

*** 2020-11-10 14:19:48.213
Mode of disk 0 (ARCHDG_0000) in grp 2 changing from 0x9 to 0x1     ----可以看出状态被修改
Mode of disk 1 (ARCHDG_0001) in grp 2 changing from 0x9 to 0x1
Mode of disk 2 (ARCHDG_0002) in grp 2 changing from 0x9 to 0x7f
Mode of disk 3 (ARCHDG_0003) in grp 2 changing from 0x9 to 0x7f
Mode of disk 5 (ARCHDG_0005) in grp 2 changing from 0x9 to 0x7f
POST res = 12 
=============== PST ==================== 
grpNum:    2 
state:     1 
callCnt:   25 
(lockvalue) valid=1 ver=0.0 ndisks=2 flags=0x3 from inst=0 (I am 1) last=0
--------------- HDR -------------------- 
next:    8 
last:    8 
pst count:       2 
pst locations:   2  5     --PST位置发生变化
incarn:          7 
dta size:        6 
version:         1 
ASM version:     186646528 = 11.2.0.0.0
contenttype:     1
partnering pattern:      [ ]
--------------- LOC MAP ---------------- 
0: dirty 0       cur_loc: 0      stable_loc: 0
1: dirty 0       cur_loc: 1      stable_loc: 1
--------------- DTA -------------------- 
0: sts v v(--) p(--) a(-) d(-) fg# = 1 addTs = 2463424728 parts: 2 (amp) 4 (amp) 3 (amp) 5 (amp) 
1: sts v v(--) p(--) a(-) d(-) fg# = 1 addTs = 2463424728 parts: 3 (amp) 5 (amp) 2 (amp) 4 (amp) 
2: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2463424728 parts: 0 (amp) 5 (amp) 1 (amp) 4 (amp) 
3: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2463424728 parts: 4 (amp) 1 (amp) 0 (amp) 5 (amp) 
4: sts v v(rw) p(rw) a(x) d(x) fg# = 3 addTs = 2463424728 parts: 3 (amp) 0 (amp) 1 (amp) 2 (amp) 
5: sts v v(rw) p(rw) a(x) d(x) fg# = 3 addTs = 2463424728 parts: 1 (amp) 2 (amp) 3 (amp) 0 (amp)

故障原因

即时FG online回来。PST认为这些盘的状态不是可正常访问的。
在这里插入图片描述

解决方案

查看PST中的磁盘状态

[grid@11gasm ~]$ kfed read /dev/asm_arch03 aun=1 blkn=2|grep status|grep -v "I=0"
kfdpDtaEv1[0].status:                 1 ; 0x000: I=1 V=0 V=0 P=0 P=0 A=0 D=0
kfdpDtaEv1[1].status:                 1 ; 0x030: I=1 V=0 V=0 P=0 P=0 A=0 D=0
kfdpDtaEv1[2].status:               127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[3].status:               127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[4].status:               127 ; 0x0c0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[5].status:               127 ; 0x0f0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
[grid@11gasm ~]$ kfed read /dev/asm_arch03 aun=1 blkn=3|grep status|grep -v "I=0" 
kfdpDtaEv1[0].status:                 1 ; 0x000: I=1 V=0 V=0 P=0 P=0 A=0 D=0
kfdpDtaEv1[1].status:                 1 ; 0x030: I=1 V=0 V=0 P=0 P=0 A=0 D=0
kfdpDtaEv1[2].status:               127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[3].status:               127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[4].status:               127 ; 0x0c0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[5].status:               127 ; 0x0f0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
[grid@11gasm ~]$ kfed read /dev/asm_arch03 aun=1 blkn=4|grep status|grep -v "I=0" 
[grid@11gasm ~]$ kfed read /dev/asm_arch03 aun=1 blkn=5|grep status|grep -v "I=0" 
[grid@11gasm ~]$ kfed read /dev/asm_arch06 aun=1 blkn=1|grep status|grep -v "I=0"  
[grid@11gasm ~]$ kfed read /dev/asm_arch06 aun=1 blkn=2|grep status|grep -v "I=0" 
kfdpDtaEv1[0].status:                 1 ; 0x000: I=1 V=0 V=0 P=0 P=0 A=0 D=0
kfdpDtaEv1[1].status:                 1 ; 0x030: I=1 V=0 V=0 P=0 P=0 A=0 D=0
kfdpDtaEv1[2].status:               127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[3].status:               127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[4].status:               127 ; 0x0c0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[5].status:               127 ; 0x0f0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
[grid@11gasm ~]$ kfed read /dev/asm_arch06 aun=1 blkn=3|grep status|grep -v "I=0" 
kfdpDtaEv1[0].status:                 1 ; 0x000: I=1 V=0 V=0 P=0 P=0 A=0 D=0
kfdpDtaEv1[1].status:                 1 ; 0x030: I=1 V=0 V=0 P=0 P=0 A=0 D=0
kfdpDtaEv1[2].status:               127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[3].status:               127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[4].status:               127 ; 0x0c0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[5].status:               127 ; 0x0f0: I=1 V=1 V=1 P=1 P=1 A=1 D=1

修改磁盘状态

这里将磁盘2和5的状态值修改为127即可。从前面的gmon trace也可以相对应。

kfdpDtaEv1[0].status:               127 ; 0x000: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[1].status:               127 ; 0x030: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[2].status:               127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[3].status:               127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[4].status:               127 ; 0x0c0: I=1 V=1 V=1 P=1 P=1 A=1 D=1
kfdpDtaEv1[5].status:               127 ; 0x0f0: I=1 V=1 V=1 P=1 P=1 A=1 D=1

[grid@11gasm ~]$  kfed read /dev/asm_arch03 aun=1 blkn=2>/tmp/0302.txt
[grid@11gasm ~]$  kfed read /dev/asm_arch03 aun=1 blkn=3>/tmp/0303.txt
[grid@11gasm ~]$ kfed read /dev/asm_arch06 aun=1 blkn=2>/tmp/0602.txt
[grid@11gasm ~]$ kfed read /dev/asm_arch06 aun=1 blkn=3>/tmp/0603.txt

挂载磁盘组

SQL> alter diskgroup archdg mount;
alter diskgroup archdg mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "4" is missing from group number "2"


SQL> alter diskgroup archdg mount force;

Diskgroup altered.

数据验证

select  object_id,object_name,
dbms_rowid.rowid_relative_fno(rowid) rel_fno#,
dbms_rowid.rowid_block_number(rowid) block#
from zhuo.test where rownum=1;
 OBJECT_ID OBJECT_NAM   REL_FNO#     BLOCK#
---------- ---------- ---------- ----------
        20 ICOL$               6        131

本例用到如下知识点：
1、如下场景中PST 可能被重定位：
存有PST的ASM DISK不可用了(当ASM启东时)
ASM DISK OFFLINE了 —本例用到
当对PST的读写发生了I/O错误
disk被正常DROP了
2、在读取其他ASM metadata之前会先检查PST
当ASM实例被要求mount diskgroup时，GMON进程会读取diskgroup中所有磁盘去找到和确认PST拷贝
如果他发现有足够的PST，那么会mount diskgroup
之后，PST会被缓存在ASM缓存中，以及GMON的PGA中并使用排他的PT.n.0锁保护
同集群中的其他ASM实例也将缓存PST到GMON的PGA，并使用共享PT.n.o锁保护
仅仅那个持有排他锁的GMON能更新磁盘上的PST信息
每一个ASM DISK上的AUN=1均为PST保留，但只有几个磁盘上真的有PST数据
3、PST存储在AU1上面，但并不是每个磁盘上面都有。PST全称Partner and Status Table。分为PST header和PST table block两部分。本例用到的是PST table block。在 PST中每一条记录对应Diskgroup中的一个ASM DISK。每一条记录会对一个ASM disk枚举其partners的ASM DISK。同时会有一个flag来表示该DISK是否是ONLINE可读写的。这些信息对recovery是否能做

几个重要字段
kfdpDtaEv1[0].status: 127 ; 0x000: I=1 V=1 V=1 P=1 P=1 A=1 D=1 disk status
fgNum fail group number
addTs timestamp of the addition to the diskgroup
kfdpDtaEv1[0].partner[0]: 49154 ; 0x008: P=1 P=1 PART=0x2 partner list

而且有本例可知，每个PST table block存储的内容都是相同的，所以上面merge的时候，用的都是一个修改后的文件。

在这里插入图片描述

P10ZHUO

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
ASM磁盘组ORA-15042故障处理

ASM磁盘组ORA-15042故障处理构建NORMAL冗余磁盘组确定PST位置external模拟故障现象offline fg1更新数据手动破坏4号磁盘故障出现故障原因解决方案查看PST中的磁盘状态修改磁盘状态挂载磁盘组数据验证NORMAL磁盘组中有1个failgroup意外offline(如现在市面上的一体机1个存储节点意外重启)，在这个failgroup恢复回来重新成功online之前，另外一个failgroup中有一块磁盘损坏了，此时悲剧就发生了，即使被offline的failgroup还原回来，也
复制链接

扫一扫