数据库版本
11.2.0.4.0
操作系统版本
CentOS release 6.3
事件描述:
1)数据库做完备份恢复后,使用alter database open resetlogs打开数据库。
2)添加OGG抽取进程
delete extract E_E2P_01
add extract E_E2P_01 ,tranlog,begin now
add exttrail /home/erpdev7/ggs/dirdat/e1 ,extract E_E2P_01
start E_E2P_01
3)启动OGG进程时ggserr.log会刷出如下日志:
2021-09-11 18:10:17 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
2021-09-11 18:10:18 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
2021-09-11 18:10:20 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
2021-09-11 18:10:22 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
2021-09-11 18:10:24 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
2021-09-11 18:10:26 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
2021-09-11 18:10:27 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
2021-09-11 18:10:29 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
2021-09-11 18:10:31 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
2021-09-11 18:10:33 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
2021-09-11 18:10:34 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
2021-09-11 18:10:36 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, e_e2p_01.prm: Positioning to begin time Sep 11, 2021 6:09:48 PM.
给人的感觉就是OGG初始化的时候,找不到这个点对应的redo的位置。然后过一段时间or我们手工执行alter system switch logfile的时候,数据库就挂掉了。
数据库alert日志输出
Sat Sep 11 18:15:48 2021
Errors in file /home/erpdev7/product/11.2.0/admin/DEV7_bjkjy-erp-dev7db01/diag/rdbms/dev7/DEV7/trace/DEV7_lgwr_3191.trc:
ORA-00316: log 116 of thread 1, type 0 in header is not log file
ORA-00312: online log 116 thread 1: '/u01/erpdev7data/group_116.2711.1053734237'
Errors in file /home/erpdev7/product/11.2.0/admin/DEV7_bjkjy-erp-dev7db01/diag/rdbms/dev7/DEV7/trace/DEV7_lgwr_3191.trc:
ORA-00316: log 116 of thread 1, type 0 in header is not log file
ORA-00312: online log 116 thread 1: '/u01/erpdev7data/group_116.2711.1053734237'
LGWR (ospid: 3191): terminating the instance due to error 316
Sat Sep 11 18:15:49 2021
System state dump requested by (instance=1, osid=3191 (LGWR)), summary=[abnormal instance termination].
System State dumped to trace file /home/erpdev7/product/11.2.0/admin/DEV7_bjkjy-erp-dev7db01/diag/rdbms/dev7/DEV7/trace/DEV7_diag_3170_20210911181549.trc
Dumping diagnostic data in directory=[cdmp_20210911181549], requested by (instance=1, osid=3191 (LGWR)), summary=[abnormal instance termination].
Instance terminated by LGWR, pid = 3191
很不幸的是 报错的这个logfile是current logfile。。当我们启动数据库的时候,日志输出:
Sat Sep 11 18:19:32 2021
ALTER DATABASE OPEN
Beginning crash recovery of 1 threads
parallel recovery started with 32 processes
Started redo scan
Errors in file /home/erpdev7/product/11.2.0/admin/DEV7_bjkjy-erp-dev7db01/diag/rdbms/dev7/DEV7/trace/DEV7_ora_37350.trc:
ORA-00316: log 116 of thread 1, type 0 in header is not log file
ORA-00312: online log 116 thread 1: '/u01/erpdev7data/group_116.2711.1053734237'
Aborting crash recovery due to error 316
Errors in file /home/erpdev7/product/11.2.0/admin/DEV7_bjkjy-erp-dev7db01/diag/rdbms/dev7/DEV7/trace/DEV7_ora_37350.trc:
ORA-00316: log 116 of thread 1, type 0 in header is not log file
ORA-00312: online log 116 thread 1: '/u01/erpdev7data/group_116.2711.1053734237'
Errors in file /home/erpdev7/product/11.2.0/admin/DEV7_bjkjy-erp-dev7db01/diag/rdbms/dev7/DEV7/trace/DEV7_ora_37350.trc:
ORA-00316: log 116 of thread 1, type 0 in header is not log file
ORA-00312: online log 116 thread 1: '/u01/erpdev7data/group_116.2711.1053734237'
ORA-316 signalled during: ALTER DATABASE OPEN...
Sat Sep 11 18:19:33 2021
Errors in file /home/erpdev7/product/11.2.0/admin/DEV7_bjkjy-erp-dev7db01/diag/rdbms/dev7/DEV7/trace/DEV7_m000_37424.trc:
ORA-00316: log 116 of thread 1, type 0 in header is not log file
ORA-00312: online log 116 thread 1: '/u01/erpdev7data/group_116.2711.1053734237'
Checker run found 2 new persistent data failures
4)尝试使用隐含参数_allow_resetlogs_corruption和_allow_error_simulation,重建控制文件,recover database,然后open resetlogs,不行 直接00600,推scn也不行,最后只能重新还原数据库。
问题分析:
我们有很多线下环境,也做过很多次恢复,这是第一次出现这个问题,横向对比了一下环境的变化,就是这次的存放数据文件的文件系统变成了xfs,之前一直是ext4(修改的原因是由于库的数据文件大小超过16T,ext4无法继续扩展)。
尝试在备份还原后重建控制文件的时候,修改redo的存放路径,将其放到ext4的文件系统里,就一切正常了,配置OGG的时候也没有再次触发数据库异常。
看了网上也有很多oracle使用xfs的,开始怀疑不是xfs的问题了,且存放数据文件的盘是存储划上来的,也存在存储I/O异常导致redo损坏的可能,这方面也是要考虑的。