一次ASM环境故障解决:ASM空间不足造成的问题

一次ASM环境故障解决

由于RAC的测试环境空间不足,在给ASM添加新的磁盘空间时,出现了故障。

 

操作的步骤大致如下,在节点1启动了dbca来管理ASM设备。由于配置的部分裸设备在ASM图形界面下看不到。因此在节点1上通过root用户将裸设备的访问权限授予了oracle。

这时,从图形界面的候选磁盘中,已经可以看到这些裸设备了。通过图形界面将裸设备加到了磁盘组中。

但是这个操作出现了两个错误:ORA-15032和ORA-15075错误。

 

ORA-15032: not all alterations performed

Cause: At least one ALTER DISKGROUP action failed.

Action: Check the other messages issued along with this summary error.

 

ORA-15075: disk(s) are not visible cluster-wide

Cause: An ALTER DISKGROUP ADD DISK command specified a disk that could not be discovered by one or more nodes in a RAC cluster configuration.

Action: Determine which disks are causing the problem from the GV$OSM_DISK fixed view. Check operating system permissions for the device and the storage sub-system configuration on each node in a RAC cluster that cannot identify the disk.

 

其实ORA-15075错误中的信息已经足够明显了。如果有一定的经验或者根据这个错误进行分析就能找到问题的原因。

但是由于发生了其他的意外,导致解决问题的方向发生了变化。

一个奇怪的现象是,我认为操作已经失败了,但是这些裸设备在dbca的ASM配置中已经可见了。

当我正在检查这两个错误信息的时候。同事告诉我节点2上的实例连不上了。

通过操作系统命令检查发现,实例2已经关闭了。不过实例2的ASM实例仍然存在。看到这个现象感觉有点奇怪。对ASM的操作引起的错误,ASM实例都没有出错,怎么数据库实例关闭了呢。

检查alert文件,尝试重启系统,看看错误信息:

$ tail -500 alert*
List of nodes:
.
.
.
Thu Mar 29 17:10:24 2007
SUCCESS: disk DISK_0012 (12.4042303515) added to diskgroup DISK
SUCCESS: disk DISK_0013 (13.4042303516) added to diskgroup DISK
SUCCESS: disk DISK_0014 (14.4042303517) added to diskgroup DISK
SUCCESS: disk DISK_0015 (15.4042303518) added to diskgroup DISK
SUCCESS: disk DISK_0016 (16.4042303519) added to diskgroup DISK
Thu Mar 29 17:25:36 2007
SUCCESS: disk DISK_0017 (17.4042303525) added to diskgroup DISK
SUCCESS: disk DISK_0018 (18.4042303520) added to diskgroup DISK
SUCCESS: disk DISK_0019 (19.4042303521) added to diskgroup DISK
SUCCESS: disk DISK_0020 (20.4042303522) added to diskgroup DISK
SUCCESS: disk DISK_0021 (21.4042303523) added to diskgroup DISK
SUCCESS: disk DISK_0022 (22.4042303524) added to diskgroup DISK
Thu Mar 29 17:29:45 2007
SUCCESS: diskgroup DISK was dismounted
SUCCESS: diskgroup DISK was dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00204: error in reading (block 35, # blocks 1) of control file
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
Thu Mar 29 17:29:46 2007
LMON: terminating instance due to error 204
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_pmon_2754.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:46 2007
System state dump is made for local instance
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lms1_2797.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lms0_2793.trc:
ORA-00204: error in reading (block , # blocks ) of control file
System State dumped to trace file /data/oracle/admin/testrac/bdump/testrac2_diag_2756.trc
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmd0_2791.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_psp0_2778.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j001_677.trc:
ORA-00204: 读取控制文件时出错 (块 , # 块 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j000_3675.trc:
ORA-00204: 读取控制文件时出错 (块 , # 块 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_rbal_2982.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:52 2007
Instance terminated by LMON, pid = 2789
$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.2.0 - Production on 星期四 3月 29 17:36:07 2007

Copyright (c) 1982, 2005, Oracle. All Rights Reserved.

已连接到空闲例程。

SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DISK/testrac/spfiletestrac.ora'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/spfiletestrac.ora
ORA-15077: could not locate ASM instance serving a required diskgroup
SQL> shutdown
ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
SVR4 Error: 2: No such file or directory

其实alert文件中已经明显包含了导致错误的原因:

SUCCESS: diskgroup DISK was dismounted
SUCCESS: diskgroup DISK was dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted
Thu Mar 29 17:29:46 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_2789.trc:
ORA-00204: error in reading (block 35, # blocks 1) of control file
ORA-00202: control file: '+DISK/testrac/control01.ctl'
ORA-15078: ASM diskgroup was forcibly dismounted

ASM的磁盘组已经DISMOUNT了,由于对ASM不熟悉,因此对ASM信息没有过多的关注,只是注意了后面的信息:

Errors in file /data/oracle/admin/testrac/bdump/testrac2_j001_677.trc:
ORA-00204: 读取控制文件时出错 (块 , # 块 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_j000_3675.trc:
ORA-00204: 读取控制文件时出错 (块 , # 块 )
Thu Mar 29 17:29:47 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac2_rbal_2982.trc:
ORA-00204: error in reading (block , # blocks ) of control file
Thu Mar 29 17:29:52 2007
Instance terminated by LMON, pid = 2789

并认为这是导致问题的原因。

其实从后面的启动信息也可以看出问题:

ORA-15077: could not locate ASM instance serving a required diskgroup

ORA-15077: could not locate ASM instance serving a required diskgroup

Cause: The instance failed to perform the specified operation because it could not locate a required ASM instance.

Action: Start an ASM instance and mount the required diskgroup.

 

但是由于前一阵刚刚碰到一个bug,这个bug的关键错误信息恰好也是ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/spfiletestrac.ora,于是暂时又忽略了关键信息。Bug的详细描述可以参考:http://yangtingkun.itpub.net/post/468/272289

于是思路自然的转到这个bug上,认为这次碰到的问题可能和上次有关。尝试使用本地pfile文件启动数据库:

SQL> startup pfile=/export/home/oracle/inittestrac2.ora
ORACLE 例程已经启动。

Total System Global Area 2147483648 bytes
Fixed Size 2030296 bytes
Variable Size 503317800 bytes
Database Buffers 1627389952 bytes
Redo Buffers 14745600 bytes
ORA-00205: ?????????, ??????, ???????

再一次被误导,去检查ORA-00205错误信息。

 

ORA-00205: error in identifying control file, check alert log for more info

Cause: The system could not find a control file of the specified name and size.

Action: Check that ALL control files are online and that they are the same files that the system created at cold start time.

 

直到发现控制文件本身并没有问题——实例1一直正常运行。才意识到自己走错了路。

仔细检查了所有的错误信息,已经导致错误的产生的原因——添加磁盘组的操作。终于发现了问题的真正所在。

在授权的时候,只在节点1对裸设备进行了授权,而没有在节点2进行授权。因此,虽然节点1上的dbca配置的ASM实例可以成功的将裸设备加到磁盘组中。但是节点2同样的操作由于缺少权限,导致了磁盘组DISMOUNT,间接导致了实例关闭。

于是在节点2上对裸设备进行授权,重启ASM实例,问题解决。

$ su -
Password:
Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s1
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s3
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s4
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s5
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s6
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad6s7
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s1
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s3
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s4
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s5
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s6
# chown oracle:oinstall /dev/rdsk/c2t500601603022E66Ad7s7
$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.2.0 - Production on 星期四 3月 29 17:52:38 2007

Copyright (c) 1982, 2005, Oracle. All Rights Reserved.

连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> shutdown
ORA-01507: 未装载数据库


ORACLE 例程已经关闭。
SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DISK/testrac/spfiletestrac.ora'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/spfiletestrac.ora
ORA-15077: could not locate ASM instance serving a required diskgroup
SQL> exit从 Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options 断开
$ ps -ef|grep ASM
oracle 1993 1 0 Mar 28 ? 0:00 asm_mman_+ASM2
oracle 1979 1 0 Mar 28 ? 0:00 asm_pmon_+ASM2
oracle 1987 1 0 Mar 28 ? 0:18 asm_lmd0_+ASM2
oracle 2658 1 0 Mar 28 ? 0:00 asm_o000_+ASM2
oracle 1983 1 0 Mar 28 ? 0:00 asm_psp0_+ASM2
oracle 2332 1 0 Mar 28 ? 0:01 /data/oracle/product/10.2/database/bin/racgimon daemon ora.racnode2.ASM2.asm
oracle 1981 1 0 Mar 28 ? 0:00 asm_diag_+ASM2
oracle 1985 1 0 Mar 28 ? 0:01 asm_lmon_+ASM2
oracle 1989 1 0 Mar 28 ? 0:01 asm_lms0_+ASM2
oracle 2028 1 0 Mar 28 ? 0:04 asm_ckpt_+ASM2
oracle 2026 1 0 Mar 28 ? 0:00 asm_lgwr_+ASM2
oracle 2008 1 0 Mar 28 ? 0:01 asm_dbw0_+ASM2
oracle 2030 1 0 Mar 28 ? 0:00 asm_smon_+ASM2
oracle 2032 1 0 Mar 28 ? 0:00 asm_rbal_+ASM2
oracle 2034 1 0 Mar 28 ? 0:00 asm_gmon_+ASM2
oracle 2065 1 0 Mar 28 ? 0:01 asm_lck0_+ASM2
oracle 23532 20734 0 17:54:05 pts/1 0:00 grep ASM
oracle 15238 1 0 17:29:43 ? 0:00 asm_b000_+ASM2
$ srvctl stop asm -n racnode2
$ srvctl start asm -n racnode2
$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.2.0 - Production on 星期四 3月 29 17:55:17 2007

Copyright (c) 1982, 2005, Oracle. All Rights Reserved.

已连接到空闲例程。

SQL> startup
ORACLE 例程已经启动。

Total System Global Area 2147483648 bytes
Fixed Size 2030296 bytes
Variable Size 469763368 bytes
Database Buffers 1660944384 bytes
Redo Buffers 14745600 bytes数据库装载完毕。数据库已经打开。
SQL>

至此问题解决。其实导致问题的原因很简单,但是问题出现了需要冷静的分析和判断,否则很容易被一些其他的信息干扰而误入歧途,走了很多其他的弯路。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值