最近帮用户IBM P780扩容,
扩容完成之后,发现核心数据库启动第一个节点正常,但是启动第二个节点时,就特别慢,花了一个小时数据库才OPEN,好像是用户非正常关闭数据库导致的;
Sun Aug 24 12:30:20 2014
SMON: enabling cache recovery
Sun Aug 24 12:30:20 2014
ARC0: Evaluating archive log 5 thread 2 sequence 9860
ARC0: Unable to archive log 5 thread 2 sequence 9860
Log actively being archived by another process
Sun Aug 24 12:30:21 2014
ARC1: Completed archiving log 5 thread 2 sequence 9860
Sun Aug 24 12:37:01 2014
Successfully onlined Undo Tablespace 6.
Sun Aug 24 12:37:01 2014
SMON: enabling tx recovery
Sun Aug 24 12:37:28 2014
Database Characterset is US7ASCII
Sun Aug 24 13:00:04 2014
replication_dependency_tracking turned off (no async multimaster replication found)
Sun Aug 24 13:07:10 2014
Waited too long for library cache load lock. More info in file /oracle/app/oracle/admin/xxxx/udump/xxxx2_ora_925808.trc.
Sun Aug 24 13:08:51 2014
Waited too long for library cache load lock. More info in file /oracle/app/oracle/admin/xxxx/udump/xxxx2_ora_966808.trc.
Sun Aug 24 13:09:25 2014
Waited too long for library cache load lock. More info in file /oracle/app/oracle/admin/xxxx/udump/xxxx2_ora_995504.trc.
Sun Aug 24 13:09:25 2014
Waited too long for library cache load lock. More info in file /oracle/app/oracle/admin/xxxx/udump/xxxx2_ora_970794.trc.
Sun Aug 24 13:09:51 2014
Completed: alter database open
到周一早晨用户反映说ORACLE心跳网络通讯数据量比平时大很多,我们监控了一段时间发现 心跳网络速率在1M~4MB/s因为是10GB光纤网络,我感觉这个速率也是正常的,结果到了10点多用户就说节点2DOWN掉了。
ARC0: Completed archiving log 5 thread 2 sequence 9864
Mon Aug 25 10:07:15 2014
IPC Send timeout detected. Sender ospid 2961608
Mon Aug 25 10:07:17 2014
Communications reconfiguration: instance 0
Mon Aug 25 10:07:17 2014
Trace dumping is performing id=[cdmp_20140825100717]
Mon Aug 25 10:09:02 2014
Waiting for clusterware split-brain resolution
Mon Aug 25 10:19:02 2014
USER: terminating instance due to error 481
Mon Aug 25 10:19:02 2014
Errors in file /oracle/app/oracle/admin/xxxx/bdump/xxxx2_lmon_1118210.trc:
ORA-29740: evicted by member 1, group incarnation 7
Mon Aug 25 10:19:35 2014
Trace dumping is performing id=[cdmp_20140825101905]
DIAG: terminating instance due to error 1092
Instance terminated by DIAG, pid = 1179834
检查AIX系统ERRPT报错
# errpt |more
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
173C787F 0825150514 I S topsvcs Possible malfunction on local adapter
173C787F 0824122814 I S topsvcs Possible malfunction on local adapter
173C787F 0824122714 I S topsvcs Possible malfunction on local adapter
1BA7DF4E 0824111914 P S SRC SOFTWARE PROGRAM ERROR
CB4A951F 0824111914 I S SRC SOFTWARE PROGRAM ERROR
CB4A951F 0824111914 I S SRC SOFTWARE PROGRAM ERROR
发现有三条 Possible malfunction on local adapter报错。
# errpt -aj 173C787F
---------------------------------------------------------------------------
LABEL: TS_LOC_DOWN_ST
IDENTIFIER: 173C787F
Date/Time: Mon Aug 25 15:05:54 BEIST 2014
Sequence Number: 43985
Machine Id: 00F880244C00
Node Id: yxyxyx2
Class: S
Type: INFO
Resource Name: topsvcs
Description
Possible malfunction on local adapter
Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
Recommended Actions
Verify adapter configuration
Verify network connectivity
Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.39,6703
ERROR ID
6zV5DL.G/iyH/tE.0J66e.1...................
REFERENCE CODE
Adapter interface name
en13
Adapter offset
2
Adapter IP address
xx.xx.xx.xx
错误并不明显,如果是网络有问题,会有明显的网卡错误EN13是其他两个网卡做ETHERCHANNEL绑定生成的,但是这个并不明显,没有明显网络断开的报错。
经过分析,有可能是网卡、光纤、或者交换机端口问题。
于是等到下班以后把所有光纤光纤换掉、光纤口换掉,
重启数据库,很顺利就起来了,
第二天中午有观察了一下,一切正常。
问题很奇怪!
扩容完成之后,发现核心数据库启动第一个节点正常,但是启动第二个节点时,就特别慢,花了一个小时数据库才OPEN,好像是用户非正常关闭数据库导致的;
Sun Aug 24 12:30:20 2014
SMON: enabling cache recovery
Sun Aug 24 12:30:20 2014
ARC0: Evaluating archive log 5 thread 2 sequence 9860
ARC0: Unable to archive log 5 thread 2 sequence 9860
Log actively being archived by another process
Sun Aug 24 12:30:21 2014
ARC1: Completed archiving log 5 thread 2 sequence 9860
Sun Aug 24 12:37:01 2014
Successfully onlined Undo Tablespace 6.
Sun Aug 24 12:37:01 2014
SMON: enabling tx recovery
Sun Aug 24 12:37:28 2014
Database Characterset is US7ASCII
Sun Aug 24 13:00:04 2014
replication_dependency_tracking turned off (no async multimaster replication found)
Sun Aug 24 13:07:10 2014
Waited too long for library cache load lock. More info in file /oracle/app/oracle/admin/xxxx/udump/xxxx2_ora_925808.trc.
Sun Aug 24 13:08:51 2014
Waited too long for library cache load lock. More info in file /oracle/app/oracle/admin/xxxx/udump/xxxx2_ora_966808.trc.
Sun Aug 24 13:09:25 2014
Waited too long for library cache load lock. More info in file /oracle/app/oracle/admin/xxxx/udump/xxxx2_ora_995504.trc.
Sun Aug 24 13:09:25 2014
Waited too long for library cache load lock. More info in file /oracle/app/oracle/admin/xxxx/udump/xxxx2_ora_970794.trc.
Sun Aug 24 13:09:51 2014
Completed: alter database open
到周一早晨用户反映说ORACLE心跳网络通讯数据量比平时大很多,我们监控了一段时间发现 心跳网络速率在1M~4MB/s因为是10GB光纤网络,我感觉这个速率也是正常的,结果到了10点多用户就说节点2DOWN掉了。
ARC0: Completed archiving log 5 thread 2 sequence 9864
Mon Aug 25 10:07:15 2014
IPC Send timeout detected. Sender ospid 2961608
Mon Aug 25 10:07:17 2014
Communications reconfiguration: instance 0
Mon Aug 25 10:07:17 2014
Trace dumping is performing id=[cdmp_20140825100717]
Mon Aug 25 10:09:02 2014
Waiting for clusterware split-brain resolution
Mon Aug 25 10:19:02 2014
USER: terminating instance due to error 481
Mon Aug 25 10:19:02 2014
Errors in file /oracle/app/oracle/admin/xxxx/bdump/xxxx2_lmon_1118210.trc:
ORA-29740: evicted by member 1, group incarnation 7
Mon Aug 25 10:19:35 2014
Trace dumping is performing id=[cdmp_20140825101905]
DIAG: terminating instance due to error 1092
Instance terminated by DIAG, pid = 1179834
检查AIX系统ERRPT报错
# errpt |more
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
173C787F 0825150514 I S topsvcs Possible malfunction on local adapter
173C787F 0824122814 I S topsvcs Possible malfunction on local adapter
173C787F 0824122714 I S topsvcs Possible malfunction on local adapter
1BA7DF4E 0824111914 P S SRC SOFTWARE PROGRAM ERROR
CB4A951F 0824111914 I S SRC SOFTWARE PROGRAM ERROR
CB4A951F 0824111914 I S SRC SOFTWARE PROGRAM ERROR
发现有三条 Possible malfunction on local adapter报错。
# errpt -aj 173C787F
---------------------------------------------------------------------------
LABEL: TS_LOC_DOWN_ST
IDENTIFIER: 173C787F
Date/Time: Mon Aug 25 15:05:54 BEIST 2014
Sequence Number: 43985
Machine Id: 00F880244C00
Node Id: yxyxyx2
Class: S
Type: INFO
Resource Name: topsvcs
Description
Possible malfunction on local adapter
Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
Recommended Actions
Verify adapter configuration
Verify network connectivity
Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.39,6703
ERROR ID
6zV5DL.G/iyH/tE.0J66e.1...................
REFERENCE CODE
Adapter interface name
en13
Adapter offset
2
Adapter IP address
xx.xx.xx.xx
错误并不明显,如果是网络有问题,会有明显的网卡错误EN13是其他两个网卡做ETHERCHANNEL绑定生成的,但是这个并不明显,没有明显网络断开的报错。
经过分析,有可能是网卡、光纤、或者交换机端口问题。
于是等到下班以后把所有光纤光纤换掉、光纤口换掉,
重启数据库,很顺利就起来了,
第二天中午有观察了一下,一切正常。
问题很奇怪!
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/527318/viewspace-1262820/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/527318/viewspace-1262820/