节点2的OSW PRVTNET日志:
zzz ***Tue Mar 25 13:14:39 BEIST 2014
trying to get source for 192.168.100.1
source should be 192.168.100.2
traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.2 (192.168.100.2), 30 hops max
outgoing MTU = 1500
1 * dbrac1-priv (192.168.100.1) 0 ms 0 ms
trying to get source for 192.168.100.2
source should be 192.168.100.2
traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.2 (192.168.100.2), 30 hops max
outgoing MTU = 1500
1 dbrac2-priv (192.168.100.2) 1 ms 0 ms 0 ms
zzz ***Warning. Traceroute response is spanning snapshot intervals.
zzz ***Tue Mar 25 13:14:51 BEIST 2014
trying to get source for 192.168.100.1
source should be 192.168.100.2
traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.2 (192.168.100.2), 30 hops max
outgoing MTU = 1500
1 * dbrac1-priv (192.168.100.1) 0 ms 0 ms
trying to get source for 192.168.100.2
source should be 192.168.100.2
traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.2 (192.168.100.2), 30 hops max
outgoing MTU = 1500
1 dbrac2-priv (192.168.100.2) 1 ms 0 ms 0 ms
zzz ***Tue Mar 25 13:14:56 BEIST 2014
trying to get source for 192.168.100.1
source should be 192.168.100.2
traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.2 (192.168.100.2), 30 hops max
outgoing MTU = 1500
1 * * dbrac1-priv (192.168.100.1) 0 ms
trying to get source for 192.168.100.2
source should be 192.168.100.2
traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.2 (192.168.100.2), 30 hops max
outgoing MTU = 1500
1 dbrac2-priv (192.168.100.2) 1 ms 0 ms 0 ms
zzz ***Warning. Traceroute response is spanning snapshot intervals.
zzz ***Tue Mar 25 13:15:08 BEIST 2014
trying to get source for 192.168.100.1
source should be 192.168.100.2
traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.2 (192.168.100.2), 30 hops max
outgoing MTU = 1500
1 dbrac1-priv (192.168.100.1) 1 ms * 0 ms
trying to get source for 192.168.100.2
source should be 192.168.100.2
traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.2 (192.168.100.2), 30 hops max
outgoing MTU = 1500
1 dbrac2-priv (192.168.100.2) 1 ms 0 ms 0 ms
zzz ***Warning. Traceroute response is spanning snapshot intervals.
分析如下:
实例2和实例1之间发生了IPC Send timeout,之后实例2被驱逐出了集群。
在traceroute中我们看到当时节点间私网通讯存在问题:
traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.1 (192.168.100.1), 30 hops max
outgoing MTU = 1500
1 * * * <===============================注:*代表traceroute不成功,3个*代表执行了3次网络交互
2 * * *
3 * * *
4 * * *
5 * * *
6 * * *
7 * * *
OSW netstat日志如下:
OSW
---------------------
dbrac1_netstat_14.03.25.1300.dat
zzz Tue Mar 25 13:14:33 BEIST 2014
udp:
439252958 datagrams received
0 incomplete headers
0 bad data length fields
0 bad checksums
328052 dropped due to no socket
4 broadcast/multicast datagrams dropped due to no socket
7289 socket buffer overflows
438917613 delivered
434514260 datagrams output
...
zzz Tue Mar 25 13:15:08 BEIST 2014
udp:
439253377 datagrams received
0 incomplete headers
0 bad data length fields
0 bad checksums
328066 dropped due to no socket
4 broadcast/multicast datagrams dropped due to no socket
7289 socket buffer overflows
438918018 delivered
434516319 datagrams output
OSW
---------------------
dbrac2_netstat_14.03.25.1300.dat
zzz Tue Mar 25 13:14:33 BEIST 2014
udp:
182516839 datagrams received
0 incomplete headers
0 bad data length fields
0 bad checksums
303810 dropped due to no socket
0 broadcast/multicast datagrams dropped due to no socket
0 socket buffer overflows
182213029 delivered
188182343 datagrams output
...
zzz Tue Mar 25 13:15:02 BEIST 2014
udp:
182517239 datagrams received
0 incomplete headers
0 bad data length fields
0 bad checksums
303819 dropped due to no socket
0 broadcast/multicast datagrams dropped due to no socket
0 socket buffer overflows
182213420 delivered
188182641 datagrams output
分析:udp上没有发现overflow, bad checksums/incomplete headers etc.
OSW vmstat日志如下:
OSW
-------------------------
dbrac1_vmstat_14.03.25.1300.dat
zzz ***Tue Mar 25 13:14:39 BEIST 2014
System configuration: lcpu=24 mem=49152MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
3 0 6927214 7214 0 0 0 0 0 0 60 3262132 9492 11 9 78 1
0 0 6926387 9432 0 0 0 1414 1413 0 72 3213613 9907 10 7 82 1
2 0 6926147 9652 0 0 0 0 0 0 56 3236071 7970 6 5 89 1
zzz ***Tue Mar 25 13:14:45 BEIST 2014
System configuration: lcpu=24 mem=49152MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
3 0 6922909 12885 0 0 0 0 0 0 94 3252893 10659 11 9 79 0
2 0 6922917 12856 0 0 0 0 0 0 50 3211865 8039 7 5 88 0
1 0 6922685 13069 0 0 0 0 0 0 49 3247836 7710 6 5 88 0
zzz ***Tue Mar 25 13:14:51 BEIST 2014
System configuration: lcpu=24 mem=49152MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
2 0 6922855 12894 0 0 0 0 0 0 126 3243290 9261 11 9 79 1
1 0 6922863 12865 0 0 0 0 0 0 34 3206410 7722 7 5 88 1
2 0 6924085 11623 0 0 0 0 0 0 45 3229507 7859 7 5 88 1
zzz ***Tue Mar 25 13:14:56 BEIST 2014
System configuration: lcpu=24 mem=49152MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
1 0 6922895 12779 0 0 0 0 0 0 79 3252256 8939 11 9 80 1
1 0 6922902 12750 0 0 0 0 0 0 44 3195964 7207 7 5 87 1
2 0 6922671 12929 0 0 0 0 0 0 80 3208503 7401 6 5 89 1
OSW
-------------------------
dbrac2_vmstat_14.03.25.1300.dat
zzz ***Tue Mar 25 13:14:39 BEIST 2014
System configuration: lcpu=24 mem=49151MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
2 0 4369629 7628806 0 0 0 0 0 0 57 3252089 7493 10 8 81 0
1 0 4369637 7628777 0 0 0 0 0 0 22 3201077 6327 6 4 89 1
1 0 4369406 7628989 0 0 0 0 0 0 32 3210307 7840 6 5 89 0
zzz ***Tue Mar 25 13:14:45 BEIST 2014
System configuration: lcpu=24 mem=49151MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
2 0 4369486 7628912 0 0 0 0 0 0 118 3252929 7799 11 9 79 1
1 1 4369410 7628967 0 0 0 0 0 0 304 3210820 9094 6 5 88 1
1 0 4369179 7629179 0 0 0 0 0 0 38 3215090 5315 6 4 89 1
zzz ***Tue Mar 25 13:14:51 BEIST 2014
System configuration: lcpu=24 mem=49151MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
3 0 4369636 7628720 0 0 0 0 0 0 55 3255957 9446 11 9 79 1
2 0 4369644 7628691 0 0 0 0 0 0 22 3219176 6673 6 4 89 1
1 0 4369413 7628903 0 0 0 0 0 0 27 3205112 5832 6 4 89 1
分析:run queue 不高,b不高,cpu使用率不高,没有pi/po
dbracdb1_lmon_6423022.trc:
2014-03-25 13:14:59.599871 : kjxgrcomerr: Communications reconfig: instance 2 (50,50)
*** 2014-03-25 13:14:59.687
2014-03-25 13:14:59.687302 : kjxgrrcfg: done - ret = 3 hist 0x1679a (initial rsn: 3)
kjxgrrcfgchk: Initiating reconfig, reason=3 <======================================reason 3: communication failure
kjxgrrcfgchk: COMM rcfg - Disk Vote Required
kjfmReceiverHealthCB_CheckAll: Recievers are healthy.<============================Recievers are healthy
2014-03-25 13:14:59.687425 : kjxgrnetchk: start 0x19b28646, end 0x19b33c90
2014-03-25 13:14:59.687444 : kjxgrnetchk: Network Validation wait: 46 sec
kjxgrnetchk: ce-event: from inst 1 to inst 2 ver 0x7e8aac60
kjxgrrcfgchk: prev pstate 6 mapsz 512
kjxgrrcfgchk: new bmp: 1 2
kjxgrrcfgchk: work bmp: 1 2
kjxgrrcfgchk: rr bmp: 1 2
*** 2014-03-25 13:14:59.687
kjxgmrcfg: Reconfiguration started, type 3
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 50 0.
kjxgrs0h: disable CGS timeout
...
2014-03-25 13:15:42.511713 : kjxgrDiskVote: bitmap: 1
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 54 2.
kjfmSendAbortInstMsg: send an abort message to instance 2
dbracdb2_lmon_7995464.trc:
*** 2014-03-25 13:14:59.868
kjxgrrcfgchk: Initiating reconfig, reason=3
kjxgrrcfgchk: COMM rcfg - Disk Vote Required
kjfmReceiverHealthCB_CheckAll: Recievers are healthy.
2014-03-25 13:14:59.867489 : kjxgrnetchk: start 0xafd32909, end 0xafd3df53
2014-03-25 13:14:59.867509 : kjxgrnetchk: Network Validation wait: 46 sec
2014-03-25 13:14:59.867528 : kjxgrnetchk: Sending comm check req to inst 1
kjxgrrcfgchk: prev pstate 6 mapsz 512
kjxgrrcfgchk: new bmp: 1 2
kjxgrrcfgchk: work bmp: 1 2
kjxgrrcfgchk: rr bmp: 1 2
*** 2014-03-25 13:14:59.868
kjxgmrcfg: Reconfiguration started, type 3
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 50 0.
kjxgrs0h: disable CGS timeout
...
2014-03-25 13:15:25.722114 : kjxgrdtrt: Evicted by inst 1, seq (52, 52)
IMR state information
Inst 2, thread 2, state 0x4:124c, flags 0x12ca9:0x0001
分析:
According to lmon trace, the instance2 was evicted by instance1 with reason 3: Communication Failure.
关于reason 3介绍如下:
Troubleshooting ORA-29740 in a RAC Environment (Doc ID 219361.1)
Reason 3: Communications Failure. This can happen if:
a) The LMON processes lose communication between one another.
b) One instance loses communications with the LMS, LMD, process of
another instance.
c) The LCK processes lose communication between one another.
d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck
and is not responding to remote requests.
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend
继续分析trace日志
dbracdb1_lms0_6160700.trc:
2014-03-25 13:13:03.149035 : GSIPC:PING: rcv'd PINGREQ[4] from 2.1 (seq 0.10358654) stm 0x45af2ac2
2014-03-25 13:13:03.149109 : GSIPC:PING: send PINGACK[4] to 2.1 (seq 0.10360096)
*** 2014-03-25 13:14:54.428
IPC Send timeout detected. Sender: ospid 6160700 [Oracle@dbrac1 (LMS0)] <============LMS0 of instance1 is the sender of the IPC
Receiver: inst 2 binc 291585594 ospid 11010320 <========= LMS0 of instance2 us the receiver of the IPC
dumping request handle 11123d958
krqh = 11123d958
...
Trace Bucket Dump End: KSI
Reporting Communication error with instance 2
dbracdb2_lms0_11010320.trc:
*** 2014-03-25 13:15:25.242
===== Idle Connection Overview =====
Idle connections [0x198ea]: 1
IdleConn List: 1[r:0.10358654,t:0x198ea]
GSIPC:IKILL: ping to inst 1 start 104682 now 104824 icktm 140 psm 0
Reporting Communication error with instance 1
分析:IPC的发送者跟接受者都报告互相通信错误。
分析总结如下:
目前我们可以确认实例2是由于实例间通讯超时,被实例1驱逐出了集群。
这次IPC的发送者是实例1的LMS0 ,接收者是实例2的LMS0,这两个进程当时都是正常的,而且LMON也检查了,发现recevier is healthy。
LMON, 两个实例的LMS0都报了Communication Failure/Error,说明它们是由于通讯失败导致了问题。
从OSW看,当时两个节点的资源没有问题,但是traceroute中报了* * *,说明私网通讯有问题。
解决方法:
通过更换私网的光纤,问题解决,后持续跟进验证数据库恢复正常!