昨天在查看数据库备份情况时,查看日志发现:
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_24/thread_1_seq_595.499.883235121 thread=1 sequence=595
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_25/thread_1_seq_596.498.883270839 thread=1 sequence=596
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_25/thread_1_seq_597.497.883270839 thread=1 sequence=597
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_26/thread_1_seq_598.496.883357239 thread=1 sequence=598
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_26/thread_1_seq_599.495.883357239 thread=1 sequence=599
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_27/thread_1_seq_600.494.883443639 thread=1 sequence=600
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_27/thread_1_seq_601.493.883443639 thread=1 sequence=601
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_28/thread_1_seq_602.492.883530039 thread=1 sequence=602
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_28/thread_1_seq_603.491.883530039 thread=1 sequence=603
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_29/thread_1_seq_604.490.883616439 thread=1 sequence=604
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_29/thread_1_seq_605.489.883616439 thread=1 sequence=605
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_30/thread_1_seq_606.488.883702839 thread=1 sequence=606
数据库版本为11.2.0.4,linux环境,用的是DATA GUARD,通过日志可以看出,归档日志未传递到备库,日志不能删除,而主库一直正常运行,说明备库出了问题,于是用sqlplus / as sysdba登录备库,提示连接到一个空实例,说明备库宕了。
查看alert日志:
Wed Jun 24 01:00:01 2015
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
Wed Jun 24 15:05:41 2015
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
Wed Jun 24 15:05:42 2015
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/standby/standby/trace/standby_asmb_114106.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1261 Serial number: 11
Errors in file /u01/app/oracle/diag/rdbms/standby/standby/trace/standby_asmb_114106.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1261 Serial number: 11
ASMB (ospid: 114106): terminating the instance due to error 15064
Wed Jun 24 15:05:42 2015
System state dump requested by (instance=1, osid=114106 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/standby/standby/trace/standby_diag_114080_20150624150542.trc
Dumping diagnostic data in directory=[cdmp_20150624150542], requested by (instance=1, osid=114106 (ASMB)), summary=[abnormal instance termination].
Instance terminated by ASMB, pid = 114106
原来6月24号就已经挂了,因为备份在主库上做,策略是删除前7天的归档日志,也未看备库信息,所以一直没发现问题,看来以后要随时观察主备库情况,有问题及时解决
那么就先把备库起来再说
SQL> startup;
ORACLE instance started.
Total System Global Area 732352512 bytes
Fixed Size 1367144 bytes
Variable Size 331350936 bytes
Database Buffers 377487360 bytes
Redo Buffers 22147072 bytes
Database mounted.
Database opened.
启日志应用:
SQL> altter database recover managed standby database disconnect from session;
接下来分析日志吧,先看当时的VKTM trace file:
*** 2015-06-22 20:00:01.102
kstmchkdrift (kstmhighrestimecntkeeper:highres): Stall, backward drift ended at 1434974497506826 drift: 1071
kstmchkdrift (kstmhighrestimecntkeeper:lowres): Time stalled at 1434974401
kstmchkdrift (kstmhighrestimecntkeeper:lowres): Stall, backward drift ended at 1434974402 drift: 1
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time stalled at 1435024898106410
*** 2015-06-23 10:00:01.197
kstmchkdrift (kstmhighrestimecntkeeper:highres): Stall, backward drift ended at 1435024898107666 drift: 1256
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time stalled at 1435028498622049
*** 2015-06-23 11:00:01.108
kstmchkdrift (kstmhighrestimecntkeeper:highres): Stall, backward drift ended at 1435028498623142 drift: 1093
kstmchkdrift (kstmhighrestimecntkeeper:lowres): Time stalled at 1435028401
kstmchkdrift (kstmhighrestimecntkeeper:lowres): Stall, backward drift ended at 1435028402 drift: 1
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time stalled at 1435078899223760
*** 2015-06-24 01:00:01.205
kstmchkdrift (kstmhighrestimecntkeeper:highres): Stall, backward drift ended at 1435078899224947 drift: 1187
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time stalled at 1435082499729169
*** 2015-06-24 02:00:01.106
kstmchkdrift (kstmhighrestimecntkeeper:highres): Stall, backward drift ended at 1435082499730249 drift: 1080
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time jumped forward by (339800056)usec at (1435129640131631) whereas (1000000) is allowed
一直有关于时间不同步的警告,查看service ntpd tatus,发现ntp已经停了,于是先启动ntp,service ntpd start
根据alert中的信息,把systemstate dump信息转换下,用aas.awk工具:
awk -f ass1038.awk /u01/app/oracle/diag/rdbms/standby/standby/trace/standby_diag_114080_20150624150542.trc
Starting Systemstate 1
.................................................................
Ass.Awk Version 1.0.38
~~~~~~~~~~~~~~~~~~~~~~
Source file : /u01/app/oracle/diag/rdbms/standby/standby/trace/standby_diag_114080_20150624150542.trc
System State 1 (2015-06-24 15:05:42.666)
~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~
1: [DEAD]
2: waiting for 'pmon timer' [DEAD]
3: waiting for 'rdbms ipc message' [DEAD]
4: waiting for 'VKTM Logical Idle Wait' [DEAD]
5: waiting for 'rdbms ipc message' [DEAD]
6: waiting for 'DIAG idle wait'
7: waiting for 'rdbms ipc message' [DEAD]
8: waiting for 'DIAG idle wait' [DEAD]
9: waiting for 'rdbms ipc message' [DEAD]
10: waiting for 'rdbms ipc message'
11: waiting for 'rdbms ipc message'
12: waiting for 'rdbms ipc message'
13: waiting for 'rdbms ipc message'
14: waiting for 'rdbms ipc message' [DEAD]
15: waiting for 'rdbms ipc message' [DEAD]
16: waiting for 'smon timer' [DEAD]
17: waiting for 'rdbms ipc message' [DEAD]
18: waiting for 'rdbms ipc message' [DEAD]
19:
20: waiting for 'rdbms ipc message' [DEAD]
21: waiting for 'rdbms ipc message' [DEAD]
22: waiting for 'wait for unread message on broadcast channel' [DEAD]
23: waiting for 'SQL*Net message from client' [DEAD]
24: waiting for 'SQL*Net message from client' [DEAD]
26: waiting for 'rdbms ipc message' [DEAD]
27: waiting for 'rdbms ipc message' [DEAD]
28: waiting for 'rdbms ipc message' [DEAD]
29: waiting for 'rdbms ipc message' [DEAD]
30: waiting for 'rdbms ipc message' [DEAD]
32: waiting for 'SQL*Net message from client' [DEAD]
34: waiting for 'SQL*Net message from client' [DEAD]
36: waiting for 'class slave wait' [DEAD]
37: waiting for 'MRP wait on archivelog arrival' [DEAD]
38: waiting for 'parallel recovery slave next change' [DEAD]
39: waiting for 'parallel recovery slave next change' [DEAD]
40: waiting for 'parallel recovery slave next change' [DEAD]
41: waiting for 'parallel recovery slave next change' [DEAD]
42: waiting for 'parallel recovery slave next change' [DEAD]
43: waiting for 'parallel recovery slave next change' [DEAD]
44: waiting for 'parallel recovery slave next change' [DEAD]
45: waiting for 'parallel recovery slave next change' [DEAD]
46: waiting for 'parallel recovery slave next change' [DEAD]
47: waiting for 'parallel recovery slave next change' [DEAD]
48: waiting for 'parallel recovery slave next change' [DEAD]
49: waiting for 'parallel recovery slave next change' [DEAD]
50: waiting for 'parallel recovery slave next change' [DEAD]
51: waiting for 'parallel recovery slave next change' [DEAD]
52: waiting for 'parallel recovery slave next change' [DEAD]
53: waiting for 'parallel recovery slave next change' [DEAD]
54: waiting for 'parallel recovery slave next change' [DEAD]
55: waiting for 'parallel recovery slave next change' [DEAD]
56: waiting for 'parallel recovery slave next change' [DEAD]
57: waiting for 'parallel recovery slave next change' [DEAD]
58: waiting for 'parallel recovery slave next change' [DEAD]
59: waiting for 'parallel recovery slave next change' [DEAD]
60: waiting for 'parallel recovery slave next change' [DEAD]
61: waiting for 'parallel recovery slave next change' [DEAD]
62: waiting for 'parallel recovery slave next change' [DEAD]
63: waiting for 'parallel recovery slave next change' [DEAD]
64: waiting for 'parallel recovery slave next change' [DEAD]
65: waiting for 'parallel recovery slave next change' [DEAD]
66: waiting for 'parallel recovery slave next change' [DEAD]
67: waiting for 'parallel recovery slave next change' [DEAD]
68: waiting for 'parallel recovery slave next change' [DEAD]
69: waiting for 'parallel recovery slave next change' [DEAD]
NO BLOCKING PROCESSES FOUND
------------------ooOoo------------------
For the LATEST version of this utility see
http://dlsunuk11.uk.oracle.com/Public/Utils.html#ass
For additional documentation see
http://dlsunuk11.uk.oracle.com/Public/TOOLS/Ass.html
Suggested improvements, bugs etc. should be sent to kevin.p.quinn@oracle.com
End of report. 49578 Lines Processed.
由于ASMB进程死掉,导致oralce实例宕,于是不能接收从主库传递过来的日志信息,起因就是ntp出了问题
具体可见MOS ORA-15064 as per ASM instance founds sessions unresponsive for over 3 mins (文档 ID 1615459.1)
根据MOS文档 Base Bug 11837095介绍,此bug已在11.2.0.3中fixed.
根据MOS文档 ID 1347586.1介绍,可以设置事件来解决此问题:
alter system set event="10795 trace name context forever, level 2" scope=spfile;
然后重启实例。后期要注意观察。另可见大神惜纷飞的博客也有说明:http://www.oratping.com/?p=359