一个客户的11.2 RAC for Linux X86-64环境停电后出现了故障,RAC环境无法自动启动。
这一篇数据库无法启动的问题。
11.2 RAC自动启动报错诊断:http://yangtingkun.itpub.net/post/468/518656
11.2 RAC自动启动报错诊断(二):http://yangtingkun.itpub.net/post/468/518805
本以为帮客户定位问题后,剩下就是硬件工程师的工作了,没有想到这时由于I/O问题,导致数据库实例又一次DOWN掉,且无法启动。
[oracle@Oracle-01 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.2.0 Production on Wed Jun 1 17:19:13 2011
Copyright (c) 1982, 2010, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area 2.0176E+10 bytes
Fixed Size 2235176 bytes
Variable Size 1.0201E+10 bytes
Database Buffers 9932111872 bytes
Redo Buffers 41492480 bytes
Database mounted.
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 20694 change 7321678 time 06/01/2011 16:57:46
ORA-00312: online log 3 thread 2: '+DG_REDO/spsp/onlinelog/group_3.259.751045771'
由于CURRENT REDO LOG损坏,常规方法已经无法打开数据库了,只有使用隐含参数,创建PFILE后,添加下列隐含参数:
*._allow_resetlogs_corruption=true
*._allow_error_simulation=true
再次尝试启动:
[oracle@Oracle-01 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.2.0 Production on Wed Jun 1 17:37:01 2011
Copyright (c) 1982, 2010, Oracle. All rights reserved.
Connected to an idle instance.
SQL> shutdown abort
ORACLE instance shut down.
SQL> startup mount pfile=initsp.ora
ORACLE instance started.
Total System Global Area 2.0176E+10 bytes
Fixed Size 2235176 bytes
Variable Size 1.0201E+10 bytes
Database Buffers 9932111872 bytes
Redo Buffers 41492480 bytes
ORA-00205: error in identifying control file, check alert log for more info
这时,控制文件又出现了错误,检查alert文件,发现原来整个ASM磁盘组都OFFLINE了:
SQL> ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:0:2} */
SUCCESS: ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:0:2} */
NOTE: Attempting voting file refresh on diskgroup DG_CRS
Wed Jun 01 17:45:12 2011
WARNING: failed to online diskgroup resource ora.DG_ARCH.dg (unable to communicate with CRSD/OHASD)
Wed Jun 01 17:45:13 2011
NOTE: [crsd.bin@Oracle-01 (TNS V1-V3) 13776] opening OCR file
Starting background process ASMB
Wed Jun 01 17:45:13 2011
ASMB started with pid=25, OS id=13788
Wed Jun 01 17:45:13 2011
NOTE: client +ASM1:+ASM registered, osid 13790, mbr 0x0
WARNING: failed to online diskgroup resource ora.DG_CRS.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.DG_DATA.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.DG_FLSH.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.DG_REDO.dg (unable to communicate with CRSD/OHASD)
Wed Jun 01 17:46:18 2011
显然又是I/O问题导致了ASM磁盘组自动OFFLINE,果然没用多久,当前节点被CLUSTER踢出,自动重启。
自动重启后,ASM实例自动启动,但是数据库实例由于日志文件损坏,而无法自动启动,尝试加载隐含参数后RESETLOG日志:
SQL> startup mount pfile=/home/oracle/initsp.ora
ORACLE instance started.
Total System Global Area 2.0176E+10 bytes
Fixed Size 2235176 bytes
Variable Size 1.0201E+10 bytes
Database Buffers 9932111872 bytes
Redo Buffers 41492480 bytes
Database mounted.
SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01139: RESETLOGS option only valid after an incomplete database recovery
SQL> recover database until cancel
ORA-00279: change 7291882 generated at 06/01/2011 16:39:56 needed for thread 2
ORA-00289: suggestion : +DG_ARCH
ORA-00280: change 7291882 for thread 2 is in sequence #321
Specify log: {=suggested | filename | AUTO | CANCEL}
cancel
ORA-10879: error signaled in parallel recovery slave
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG_DATA/spsp/datafile/system.256.751041975'
SQL> alter database open resetlogs;
数据库一直处于等待RESETLOGS的状态,很长时间都没用结束。
检查alert告警日志,看看数据库有没有什么异常信息:
Thread 1 opened at log sequence 1
Current log# 1 seq# 1 mem# 0: +DG_REDO/spsp/onlinelog/group_1.257.752695815
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Wed Jun 01 18:10:16 2011
SMON: enabling cache recovery
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Errors in file /oracle/diag/rdbms/spsp/SPSP1/trace/SPSP1_ora_19044.trc (incident=250279):
ORA-00600: internal error code, arguments: [2662], [0], [7291890], [0], [7314785], [12583040], [], [], [], [], [], []
Incident details in: /oracle/diag/rdbms/spsp/SPSP1/incident/incdir_250279/SPSP1_ora_19044_i250279.trc
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
Redo thread 2 internally disabled at seq 1 (CKPT)
ARC2: Archiving disabled thread 2 sequence 1
Archived Log entry 785 added for thread 2 sequence 1 ID 0x0 dest 1:
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/diag/rdbms/spsp/SPSP1/trace/SPSP1_ora_19044.trc:
ORA-00600: internal error code, arguments: [2662], [0], [7291890], [0], [7314785], [12583040], [], [], [], [], [], []
Errors in file /oracle/diag/rdbms/spsp/SPSP1/trace/SPSP1_ora_19044.trc:
ORA-00600: internal error code, arguments: [2662], [0], [7291890], [0], [7314785], [12583040], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 19044): terminating the instance due to error 600
Instance terminated by USER, pid = 19044
ORA-1092 signalled during: alter database open resetlogs...
Wed Jun 01 18:12:13 2011
ORA-1092 : opitsk aborting process
Wed Jun 01 18:12:18 2011
ORA-1092 : opitsk aborting process
Wed Jun 01 18:12:29 2011
ORA-1092 : opitsk aborting process
虽然前台的命令一直没有结束,但是后台已经出现了ORA-600 [2662]的错误了,看来前台的操作报错也只是时间问题。
对于2662错误,可以通过EVENTS调整SCN的方式解决:
SQL> ALTER SESSION SET EVENTS '10015 TRACE NAME ADJUST_SCN LEVEL 1';
Session altered.
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [],
[], [], [], []
启动仍然报错,而且ORA-600 [4194]错误也出现了,解决这个同样通过隐含参数来解决,分别在两个节点的初始化参数文件中加入下面参数,将终于将两个实例启动:
undo_management='MANUAL'
_corrupted_rollback_segments=(_SYSSMU1_1508981096$,_SYSSMU2_1971546778$,_SYSSMU3_721539051$,_SYSSMU4_3849799403$,_SYSSMU5_4000746196$,_SYSSMU6_179823687$,_SYSSMU7_3892328048$,_SYSSMU8_4105802281$,_SYSSMU9_1510219536$,_SYSSMU10_3951869473$,_SYSSMU11_1202330240$,_SYSSMU12_1617713323$,_SYSSMU13_1359816937$,_SYSSMU14_2078039711$,_SYSSMU15_1538259293$,_SYSSMU16_3126366281$,_SYSSMU17_2553547900$,_SYSSMU18_1481844821$,_SYSSMU19_135756661$,_SYSSMU20_2322289537$)
_offline_rollback_segments=(_SYSSMU1_1508981096$,_SYSSMU2_1971546778$,_SYSSMU3_721539051$,_SYSSMU4_3849799403$,_SYSSMU5_4000746196$,_SYSSMU6_179823687$,_SYSSMU7_3892328048$,_SYSSMU8_4105802281$,_SYSSMU9_1510219536$,_SYSSMU10_3951869473$,_SYSSMU11_1202330240$,_SYSSMU12_1617713323$,_SYSSMU13_1359816937$,_SYSSMU14_2078039711$,_SYSSMU15_1538259293$,_SYSSMU16_3126366281$,_SYSSMU17_2553547900$,_SYSSMU18_1481844821$,_SYSSMU19_135756661$,_SYSSMU20_2322289537$)
可以看到,在11g中,回滚段的名称已经有了很大的改变,好在刚才虽然报了ORA-600的错误,但是数据库可以短暂打开,因此可以直接登录数据库检查ROLLBACK SEGMENTS段的名称。
至此RAC两个实例都已经启动,但是解决问题的根源并没有解决,因此建议客户尽快通过逻辑备份的方式备份整个数据库。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/4227/viewspace-697160/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/4227/viewspace-697160/