环境:
AIX 5.3
Oracle 10.2.0.2 RAC
2 instance
错误:看alert把
Thu Jul 10 15:37:47 2008
Errors in file /oracle/app/oracle/admin/qasmdm/bdump/qasmdm1_lgwr_446692.trc:
ORA-00340: IO error processing online log 2 of thread 1
ORA-00345: redo log write error block 781893 count 4
ORA-00312: online log 2 thread 1: '/dev/rlvredo12'
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: -1
。。。。。。
Fri Jul 11 13:22:22 2008
Errors in file /oracle/app/oracle/admin/qasmdm/bdump/qasmdm1_j000_1073290.trc:
ORA-12012: error on auto execute of job 1
ORA-00376: file 3 cannot be read at this time
ORA-01110: data file 3: '/dev/rlvsysaux01'
ORA-06512: at "SYSMAN.EMD_MAINTENANCE", line 48
ORA-06512: at "SYSMAN.EMD_MAINTENANCE", line 936
ORA-06512: at line 1
Fri Jul 11 13:22:37 2008
Errors in file /oracle/app/oracle/admin/qasmdm/bdump/qasmdm1_j000_1073290.trc:
ORA-12012: error on auto execute of job 42579
ORA-00376: file ORA-00376: file 3 cannot be read at this time
ORA-01110: data file 3: '/dev/rlvsysaux01'
ORA-06512: at "EXFSYS.DBMS_RLMGR_DR", line 15
ORA-06512: at line 1
cannot be read at this time
肯定要先看lv的状态:
[qasmdm1:oracle] /oracle/app/oracle/admin/qasmdm/bdump> lsvg -l qasmdmvg
qasmdmvg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
lv_ocr jfs 2 2 1 open/syncd N/A
lv_voting jfs 2 2 1 open/syncd N/A
lvuser03 jfs 128 128 1 open/syncd N/A
lv_temp jfs 160 160 1 open/syncd N/A
lvcntrl1 jfs 1 1 1 open/syncd N/A
lvcntrl2 jfs 1 1 1 open/syncd N/A
lvcntrl3 jfs 1 1 1 open/syncd N/A
lvsystem01 jfs 16 16 1 open/syncd N/A
lvuser01 jfs 32 32 1 open/syncd N/A
lvundotbs01 jfs 160 160 1 open/syncd N/A
lvundotbs02 jfs 160 160 1 open/syncd N/A
lvredo11 jfs 8 8 1 open/syncd N/A
lvredo12 jfs 8 8 1 open/syncd N/A
lvredo13 jfs 8 8 1 closed/syncd N/A
lvredo21 jfs 8 8 1 closed/syncd N/A
lvredo22 jfs 8 8 1 closed/syncd N/A
lvredo33 jfs 8 8 1 closed/syncd N/A
lvsysaux01 jfs 32 32 1 closed/syncd N/A
lvuser02 jfs 256 256 2 open/syncd N/A
lvuser04 jfs 128 128 2 open/syncd N/A
lvsystem02 jfs 32 32 1 open/syncd N/A
搞过aix的应该知道:closed并不代表逻辑卷有问题。并且看了一下最近的hacmp.out也没发现问题。
再看操作系统日志:
AFA89905 0711130408 I O grpsvcs Group Services daemon started
97419D60 0711130408 I O topsvcs Topology Services daemon started
AE3E3FAD 0711125608 I O SYSJ2 FSCK FOUND ERRORS
AE3E3FAD 0711125508 I O SYSJ2 FSCK FOUND ERRORS
AE3E3FAD 0711125408 I O SYSJ2 FSCK FOUND ERRORS
A6DF45AA 0711124508 I O RMCdaemon The daemon is started.
7BFEEA1F 0711124308 T H fcs0 LINK ERROR
7BFEEA1F 0711124308 T H fcs0 LINK ERROR
7BFEEA1F 0711124308 T H fcs1 LINK ERROR
7BFEEA1F 0711124308 T H fcs1 LINK ERROR
2BFA76F6 0711124308 T S SYSPROC SYSTEM SHUTDOWN BY USER
9DBCFDEE 0711124408 T O errdemon ERROR LOGGING TURNED ON
192AC071 0710203708 T O errdemon ERROR LOGGING TURNED OFF
AA8AB241 0710203708 T O OPERATOR OPERATOR NOTIFICATION
A6DF45AA 0710193808 I O RMCdaemon The daemon is started.
F45CFAFB 0710193808 P H hdisk29 OPEN FAILURE
F45CFAFB 0710193808 P H hdisk25 OPEN FAILURE
F45CFAFB 0710193808 P H hdisk15 OPEN FAILURE
F45CFAFB 0710193808 P H hdisk5 OPEN FAILURE
F45CFAFB 0710193808 P H hdisk28 OPEN FAILURE
F45CFAFB 0710193808 P H hdisk24 OPEN FAILURE
F45CFAFB 0710193808 P H hdisk14 OPEN FAILURE
F45CFAFB 0710193808 P H hdisk4 OPEN FAILURE
F45CFAFB 0710193808 P H hdisk39 OPEN FAILURE
F45CFAFB 0710193808 P H hdisk33 OPEN FAILURE
F45CFAFB 0710193808 P H hdisk19 OPEN FAILURE
F45CFAFB 0710193808 P H hdisk9 OPEN FAILURE
据说:昨天存储出过问题,但是今天中午已经修好了,并且其他表空间跟sysaux在同一个阵列上的,并且同一个vg.
理论上应该排除硬件问题了。
于是查一下数据文件的情况:
SQL> select file#,status from v$datafile;
FILE# STATUS
---------- -------
1 SYSTEM
2 ONLINE
3 RECOVER
4 ONLINE
5 ONLINE
6 SYSTEM
7 ONLINE
8 ONLINE
9 ONLINE
9 rows selected.
data file 3: '/dev/rlvsysaux01' 状态是recover
SQL> select file_id,bytes,status,online_status from dba_data_files;
FILE_ID BYTES STATUS ONLINE_
---------- ---------- --------- -------
4 8589148160 AVAILABLE ONLINE
3 AVAILABLE RECOVER
2 3.0774E+10 AVAILABLE ONLINE
1 2144337920 AVAILABLE SYSTEM
5 3.1029E+10 AVAILABLE ONLINE
6 4293918720 AVAILABLE SYSTEM
7 3.4357E+10 AVAILABLE ONLINE
8 3.4357E+10 AVAILABLE ONLINE
9 3.4357E+10 AVAILABLE ONLINE
9 rows selected.
datafile状态:available,bytes居然是null
看到recover自然要查v$recover_file,再往下查很明朗了:
SQL> select * from V$RECOVER_FILE;
FILE# ONLINE ONLINE_ ERROR CHANGE# TIME
---------- ------- ------- ----------- ---------- ---------
3 OFFLINE OFFLINE 1.6800E+10 10-JUL-08
SQL> alter database recover datafile 3;
alter database recover datafile 3
*
ERROR at line 1:
ORA-00279: change 16799532025 generated at 07/10/2008 14:56:19 needed for thread 1
ORA-00289: suggestion : /qasmdmarch1/1_4049_653004728.dbf
ORA-00280: change 16799532025 for thread 1 is in sequence #4049
到这里,有个插曲:
我一看时间以为要麻烦了,因为是昨天的归档,可能已经被备份并删除了。准备要找tsm的team的时候想 :还是先看一眼吧:
[qasmdm1:oracle] /qasmdmarch1> ls
1_4042_653004728.dbf 1_4044_653004728.dbf 1_4046_653004728.dbf 1_4048_653004728.dbf 1_4050_653004728.dbf
1_4043_653004728.dbf 1_4045_653004728.dbf 1_4047_653004728.dbf 1_4049_653004728.dbf 1_4051_653004728.dbf
归档文件居然还在,后来用recover datafile 3就不这样了
并且在这个时候做了一件事情,那就是把node 2的归档也ftp到了这个目录来。
接:一个RAC恢复案例并思路--1
SQL> alter database recover tablespace sysaux;
alter database recover tablespace sysaux
*
ERROR at line 1:
ORA-00275: media recovery has already been started
SQL> alter database recover cancel;
Database altered.
SQL> recover datafile 3;
ORA-00279: change 16799532025 generated at 07/10/2008 14:56:27 needed for thread 2
ORA-00289: suggestion : /qasmdmarch1/2_3886_653004728.dbf
ORA-00280: change 16799532025 for thread 2 is in sequence #3886
Specify log: {=suggested | filename | AUTO | CANCEL}
ORA-00279: change 16799532025 generated at 07/10/2008 14:56:19 needed for thread 1
ORA-00289: suggestion : /qasmdmarch1/1_4049_653004728.dbf
ORA-00280: change 16799532025 for thread 1 is in sequence #4049
Specify log: {=suggested | filename | AUTO | CANCEL}
ORA-00279: change 16799532174 generated at 07/10/2008 14:56:40 needed for thread 1
ORA-00289: suggestion : /qasmdmarch1/1_4050_653004728.dbf
ORA-00280: change 16799532174 for thread 1 is in sequence #4050
ORA-00278: log file '/qasmdmarch1/1_4049_653004728.dbf' no longer needed for this recovery
......
......省略了些
Specify log: {=suggested | filename | AUTO | CANCEL}
ORA-00279: change 16800179378 generated at 07/11/2008 13:06:58 needed for thread 2
ORA-00289: suggestion : /qasmdmarch1/2_3887_653004728.dbf
ORA-00280: change 16800179378 for thread 2 is in sequence #3887
ORA-00278: log file '/qasmdmarch1/2_3886_653004728.dbf' no longer needed for this recovery
Specify log: {=suggested | filename | AUTO | CANCEL}
ORA-00308: cannot open archived log '/qasmdmarch1/2_3887_653004728.dbf'
ORA-27037: unable to obtain file status
IBM AIX RISC System/6000 Error: 2: No such file or directory
Additional information: 3
SQL> recover datafile 3;
ORA-00279: change 16800179378 generated at 07/11/2008 13:06:58 needed for thread 2
ORA-00289: suggestion : /qasmdmarch1/2_3887_653004728.dbf
ORA-00280: change 16800179378 for thread 2 is in sequence #3887
Specify log: {=suggested | filename | AUTO | CANCEL}
ORA-00308: cannot open archived log '/qasmdmarch1/2_3887_653004728.dbf'
ORA-27037: unable to obtain file status
IBM AIX RISC System/6000 Error: 2: No such file or directory
Additional information: 3
到这里也应该能猜到,recover的过程中node2正好又生成了新的归档,第一次ftp没有复制过来。
把刚生成的归档再ftp过来
PEKAX131:/qasmdmarch1>ls
1_4042_653004728.dbf 1_4046_653004728.dbf 1_4050_653004728.dbf 2_3881_653004728.dbf 2_3885_653004728.dbf
1_4043_653004728.dbf 1_4047_653004728.dbf 1_4051_653004728.dbf 2_3882_653004728.dbf 2_3886_653004728.dbf
1_4044_653004728.dbf 1_4048_653004728.dbf 1_4052_653004728.dbf 2_3883_653004728.dbf 2_3887_653004728.dbf
1_4045_653004728.dbf 1_4049_653004728.dbf 1_4053_653004728.dbf 2_3884_653004728.dbf
然后重新开始recover:
SQL> recover datafile 3;
ORA-00279: change 16800179378 generated at 07/11/2008 13:06:58 needed for thread 2
ORA-00289: suggestion : /qasmdmarch1/2_3887_653004728.dbf
ORA-00280: change 16800179378 for thread 2 is in sequence #3887
Specify log: {=suggested | filename | AUTO | CANCEL}
ORA-00279: change 16800179378 generated at 07/10/2008 15:03:46 needed for thread 1
ORA-00289: suggestion : /qasmdmarch1/1_4051_653004728.dbf
ORA-00280: change 16800179378 for thread 1 is in sequence #4051
Specify log: {=suggested | filename | AUTO | CANCEL}
ORA-00279: change 16800179379 generated at 07/11/2008 13:06:59 needed for thread 1
ORA-00289: suggestion : /qasmdmarch1/1_4052_653004728.dbf
ORA-00280: change 16800179379 for thread 1 is in sequence #4052
ORA-00278: log file '/qasmdmarch1/1_4051_653004728.dbf' no longer needed for this recovery
Specify log: {=suggested | filename | AUTO | CANCEL}
Log applied.
Media recovery complete.
Perfect!
recover已经complete,然后该online 表空间了
SQL> alter tablespace sysaux online;
Tablespace altered.
SQL> select * from V$RECOVER_FILE;
no rows selected
SQL> select file_id,bytes,status,online_status from dba_data_files;
FILE_ID BYTES STATUS ONLINE_
---------- ---------- --------- -------
4 8589148160 AVAILABLE ONLINE
3 4293918720 AVAILABLE ONLINE
2 3.0774E+10 AVAILABLE ONLINE
1 2144337920 AVAILABLE SYSTEM
5 3.1029E+10 AVAILABLE ONLINE
6 4293918720 AVAILABLE SYSTEM
7 3.4357E+10 AVAILABLE ONLINE
8 3.4357E+10 AVAILABLE ONLINE
9 3.4357E+10 AVAILABLE ONLINE
已经没有问题了,file也已经online.
然后再第二个节点上:
SQL> alter tablespace sysaux online;
Tablespace altered.
原因分析和整个过程总结:
10日存储有问题,导致/dev/rlvsysaux01 offline.
因为这个表空间比较特殊,用户应用普通的查询和修改等不会被影响,所以问题没有被发现
当有job等需要碰到它操作的时候才会触发错误.
如ORA-12012: error on auto execute of job 1
结决过程思路比较简单:
因为硬件问题已经解决,所以恢复也比较简单就是recover datafile
由于rac的两个instance 各自归档在本地,所以在其中一个节点recover的时候要保证它能读到两个instance的归档,所以用了ftp把归档都放在了执行recover的节点
redo log 本来就是共享的,所以不用关心,自然能recover.
最后找到了错误的源头:
hu Jul 10 15:38:02 2008
Errors in file /oracle/app/oracle/admin/qasmdm/bdump/qasmdm2_lgwr_573572.trc:
ORA-00345: redo log write error block 586061 count 2
ORA-00312: online log 4 thread 2: '/dev/rlvredo22'
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 46: Device not ready
Additional information: -1
Additional information: 1024