将曾经经历过的一起RAC故障总结在此,希望引起大家的思考和总结。
1.故障现象
故障现象非常的明显,无论使用何种手段均无法启动RAC对应的服务。
在尝试启动数据库实例的过程中alert记录了如下信息。
Wed Feb 9 09:32:33 2011
This instance was first to mount
Wed Feb 9 09:32:33 2011
ORA-00202: control file: '+ORADATA/racdb/controlfile/current.256.668238016'
ORA-17503: ksfdopn:2 Failed to open file +ORADATA/racdb/controlfile/current.256.668238016
ORA-15001: diskgroup "ORADATA" does not exist or is not mounted
ORA-15077: could not locate ASM instance serving a required diskgroup
ORA-205 signalled during: ALTER DATABASE MOUNT...
Wed Feb 9 09:35:13 2011
Reconfiguration started (old inc 2, new inc 4)
List of nodes:
0 1
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
这里提示磁盘组ORADATA未挂载成功,无法打开对应的控制文件。
2.故障分析
单纯从alert日志之中进行分析,很难直接定位是由于存储阵列的故障引发的这次数据库问题。
进一步追溯故障的根源。
1)查看裸设备的配置文件
cat /etc/udev/rules.d/60-raw.rules
# block devices with O_DIRECT.
#
# Enter raw device bindings here.
#
# An example would be:
# ACTION=="add", KERNEL=="sda", RUN+="/bin/raw /dev/raw/raw1 %N"
# to bind /dev/raw/raw1 to /dev/sda, or
# ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="1", RUN+="/bin/raw /dev/raw/raw2 %M %m"
# to bind /dev/raw/raw2 to the device with major 8, minor 1.
ACTION=="add", KERNEL=="/dev/sda1", RUN+="/bin/raw /dev/raw/raw1 %N"
ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="1", RUN+="/bin/raw /dev/raw/raw1 %M %m"
ACTION=="add", KERNEL=="/dev/sdb1", RUN+="/bin/raw /dev/raw/raw2 %N"
ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="17", RUN+="/bin/raw /dev/raw/raw2 %M %m"
ACTION=="add", KERNEL=="/dev/sdc1", RUN+="/bin/raw /dev/raw/raw3 %N"
ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="33", RUN+="/bin/raw /dev/raw/raw3 %M %m"
ACTION=="add", KERNEL=="/dev/sdd1", RUN+="/bin/raw /dev/raw/raw4 %N"
ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="49", RUN+="/bin/raw /dev/raw/raw4 %M %m"
ACTION=="add", KERNEL=="/dev/sdc2", RUN+="/bin/raw /dev/raw/raw5 %N"
ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="34", RUN+="/bin/raw /dev/raw/raw5 %M %m"
KERNEL=="raw1", WNER="oracle", GROUP="oinstall", MODE="0600"
KERNEL=="raw2", WNER="oracle", GROUP="oinstall", MODE="0600"
KERNEL=="raw3", WNER="oracle", GROUP="oinstall", MODE="0600"
KERNEL=="raw4", WNER="oracle", GROUP="oinstall", MODE="0600"
KERNEL=="raw5", WNER="oracle", GROUP="oinstall", MODE="0600"
2)查看操作系统中与之对应的设备信息
racdb1@racdb1 /dev$ ls -l | grep sd
brw-r----- 1 root disk 8, 0 2011-02-09 sda
brw-r----- 1 root disk 8, 1 02-09 09:26 sda1
brw-r----- 1 root disk 8, 2 02-09 09:26 sda2
brw-r----- 1 root disk 8, 16 2011-02-09 sdb
brw-r----- 1 root disk 8, 17 02-09 09:26 sdb1
brw-r----- 1 root disk 8, 18 02-09 09:26 sdb2
brw-r----- 1 root disk 8, 32 2011-02-09 sdc
brw-r----- 1 root disk 8, 33 02-09 09:26 sdc1
brw-r----- 1 root disk 8, 34 02-09 09:26 sdc2
故障的根本原因浮出水面,此时系统中并未识别到sdd相关设备!
这里的sdd相关设备便是磁盘组ORADATA需要的资源!
3)查看RAC所需磁盘信息
[root@racdb2 dev]# /etc/init.d/oracleasm listdisks
[root@racdb2 dev]#
返回结果为空,表明无可用存储。
3.故障处理
既然定位了故障出处,处理起来也便有的放矢。
最终确认了是由于存储阵列的电源故障引发的此次问题。
恢复故障后再次检查设备信息
racdb1@racdb1 /dev$ ls -l | grep sd
brw-r----- 1 root disk 8, 0 2011-02-09 sda
brw-r----- 1 root disk 8, 1 02-09 11:22 sda1
brw-r----- 1 root disk 8, 2 02-09 11:22 sda2
brw-r----- 1 root disk 8, 16 2011-02-09 sdb
brw-r----- 1 root disk 8, 17 02-09 11:22 sdb1
brw-r----- 1 root disk 8, 18 02-09 11:22 sdb2
brw-r----- 1 root disk 8, 32 2011-02-09 sdc
brw-r----- 1 root disk 8, 33 02-09 11:22 sdc1
brw-r----- 1 root disk 8, 34 02-09 11:22 sdc2
brw-r----- 1 root disk 8, 48 2011-02-09 sdd
brw-r----- 1 root disk 8, 49 02-09 11:22 sdd1
brw-r----- 1 root disk 8, 64 2011-02-09 sde
brw-r----- 1 root disk 8, 65 02-09 11:22 sde1
brw-r----- 1 root disk 8, 66 02-09 11:22 sde2
brw-r----- 1 root disk 8, 67 02-09 11:22 sde3
可见,存储上的设备已一一被识别。
racdb1@racdb1 /dev/raw$ ls -tlr
crw------- 1 oracle oinstall 162, 3 02-09 11:22 raw3
crw------- 1 oracle oinstall 162, 5 02-09 11:22 raw5
crw------- 1 oracle oinstall 162, 1 02-09 11:31 raw1
crw------- 1 oracle oinstall 162, 4 02-09 11:31 raw4
crw------- 1 oracle oinstall 162, 2 02-09 11:31 raw2
故障得到比较圆满的处理。
4.小结
RAC技术可以说是Oracle在高可用上的佳作和绝技,但是对于RAC的维护需要格外的细心,稍有不慎便会在细枝末节上出现纰漏,后果很严重!
谨以此文提醒RAC DBA朋友们:定期全面的健康检查不是例行公事,而是保障数据库高效稳定运行的前提!
Good luck.
secooler
11.03.29
-- The End --
1.故障现象
故障现象非常的明显,无论使用何种手段均无法启动RAC对应的服务。
在尝试启动数据库实例的过程中alert记录了如下信息。
Wed Feb 9 09:32:33 2011
This instance was first to mount
Wed Feb 9 09:32:33 2011
ORA-00202: control file: '+ORADATA/racdb/controlfile/current.256.668238016'
ORA-17503: ksfdopn:2 Failed to open file +ORADATA/racdb/controlfile/current.256.668238016
ORA-15001: diskgroup "ORADATA" does not exist or is not mounted
ORA-15077: could not locate ASM instance serving a required diskgroup
ORA-205 signalled during: ALTER DATABASE MOUNT...
Wed Feb 9 09:35:13 2011
Reconfiguration started (old inc 2, new inc 4)
List of nodes:
0 1
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
这里提示磁盘组ORADATA未挂载成功,无法打开对应的控制文件。
2.故障分析
单纯从alert日志之中进行分析,很难直接定位是由于存储阵列的故障引发的这次数据库问题。
进一步追溯故障的根源。
1)查看裸设备的配置文件
cat /etc/udev/rules.d/60-raw.rules
# block devices with O_DIRECT.
#
# Enter raw device bindings here.
#
# An example would be:
# ACTION=="add", KERNEL=="sda", RUN+="/bin/raw /dev/raw/raw1 %N"
# to bind /dev/raw/raw1 to /dev/sda, or
# ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="1", RUN+="/bin/raw /dev/raw/raw2 %M %m"
# to bind /dev/raw/raw2 to the device with major 8, minor 1.
ACTION=="add", KERNEL=="/dev/sda1", RUN+="/bin/raw /dev/raw/raw1 %N"
ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="1", RUN+="/bin/raw /dev/raw/raw1 %M %m"
ACTION=="add", KERNEL=="/dev/sdb1", RUN+="/bin/raw /dev/raw/raw2 %N"
ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="17", RUN+="/bin/raw /dev/raw/raw2 %M %m"
ACTION=="add", KERNEL=="/dev/sdc1", RUN+="/bin/raw /dev/raw/raw3 %N"
ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="33", RUN+="/bin/raw /dev/raw/raw3 %M %m"
ACTION=="add", KERNEL=="/dev/sdd1", RUN+="/bin/raw /dev/raw/raw4 %N"
ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="49", RUN+="/bin/raw /dev/raw/raw4 %M %m"
ACTION=="add", KERNEL=="/dev/sdc2", RUN+="/bin/raw /dev/raw/raw5 %N"
ACTION=="add", ENV{MAJOR}=="8", ENV{MINOR}=="34", RUN+="/bin/raw /dev/raw/raw5 %M %m"
KERNEL=="raw1", WNER="oracle", GROUP="oinstall", MODE="0600"
KERNEL=="raw2", WNER="oracle", GROUP="oinstall", MODE="0600"
KERNEL=="raw3", WNER="oracle", GROUP="oinstall", MODE="0600"
KERNEL=="raw4", WNER="oracle", GROUP="oinstall", MODE="0600"
KERNEL=="raw5", WNER="oracle", GROUP="oinstall", MODE="0600"
2)查看操作系统中与之对应的设备信息
racdb1@racdb1 /dev$ ls -l | grep sd
brw-r----- 1 root disk 8, 0 2011-02-09 sda
brw-r----- 1 root disk 8, 1 02-09 09:26 sda1
brw-r----- 1 root disk 8, 2 02-09 09:26 sda2
brw-r----- 1 root disk 8, 16 2011-02-09 sdb
brw-r----- 1 root disk 8, 17 02-09 09:26 sdb1
brw-r----- 1 root disk 8, 18 02-09 09:26 sdb2
brw-r----- 1 root disk 8, 32 2011-02-09 sdc
brw-r----- 1 root disk 8, 33 02-09 09:26 sdc1
brw-r----- 1 root disk 8, 34 02-09 09:26 sdc2
故障的根本原因浮出水面,此时系统中并未识别到sdd相关设备!
这里的sdd相关设备便是磁盘组ORADATA需要的资源!
3)查看RAC所需磁盘信息
[root@racdb2 dev]# /etc/init.d/oracleasm listdisks
[root@racdb2 dev]#
返回结果为空,表明无可用存储。
3.故障处理
既然定位了故障出处,处理起来也便有的放矢。
最终确认了是由于存储阵列的电源故障引发的此次问题。
恢复故障后再次检查设备信息
racdb1@racdb1 /dev$ ls -l | grep sd
brw-r----- 1 root disk 8, 0 2011-02-09 sda
brw-r----- 1 root disk 8, 1 02-09 11:22 sda1
brw-r----- 1 root disk 8, 2 02-09 11:22 sda2
brw-r----- 1 root disk 8, 16 2011-02-09 sdb
brw-r----- 1 root disk 8, 17 02-09 11:22 sdb1
brw-r----- 1 root disk 8, 18 02-09 11:22 sdb2
brw-r----- 1 root disk 8, 32 2011-02-09 sdc
brw-r----- 1 root disk 8, 33 02-09 11:22 sdc1
brw-r----- 1 root disk 8, 34 02-09 11:22 sdc2
brw-r----- 1 root disk 8, 48 2011-02-09 sdd
brw-r----- 1 root disk 8, 49 02-09 11:22 sdd1
brw-r----- 1 root disk 8, 64 2011-02-09 sde
brw-r----- 1 root disk 8, 65 02-09 11:22 sde1
brw-r----- 1 root disk 8, 66 02-09 11:22 sde2
brw-r----- 1 root disk 8, 67 02-09 11:22 sde3
可见,存储上的设备已一一被识别。
racdb1@racdb1 /dev/raw$ ls -tlr
crw------- 1 oracle oinstall 162, 3 02-09 11:22 raw3
crw------- 1 oracle oinstall 162, 5 02-09 11:22 raw5
crw------- 1 oracle oinstall 162, 1 02-09 11:31 raw1
crw------- 1 oracle oinstall 162, 4 02-09 11:31 raw4
crw------- 1 oracle oinstall 162, 2 02-09 11:31 raw2
故障得到比较圆满的处理。
4.小结
RAC技术可以说是Oracle在高可用上的佳作和绝技,但是对于RAC的维护需要格外的细心,稍有不慎便会在细枝末节上出现纰漏,后果很严重!
谨以此文提醒RAC DBA朋友们:定期全面的健康检查不是例行公事,而是保障数据库高效稳定运行的前提!
Good luck.
secooler
11.03.29
-- The End --
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/519536/viewspace-691262/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/519536/viewspace-691262/