###### sndb3 ######
Fri Nov 25 09:26:08 2011 IPC Send timeout detected. Receiver ospid 340396 -- sndb3
..........
Fri Nov 25 09:28:54 2011 IPC Send timeout detected. Receiver ospid 340396 -- sndb3
Fri Nov 25 10:04:19 2011 Trace dumping is performing id=[cdmp_20111125100426]
Fri Nov 25 10:10:07 2011 Warning: cluster_database_instances (2) of node 0 is <= my node id (2)
Fri Nov 25 10:12:20 2011 Reconfiguration started -- 开始 reconfig 此时 List of nodes: 0 2
Fri Nov 25 10:28:05 2011 Reconfiguration started -- 第2次 reconfig List of nodes: 0 1 2
Fri Nov 25 10:49:03 2011 Reconfiguration started -- 第3次 reconfig List of nodes: 2
Fri Nov 25 10:49:50 2011 Reconfiguration started -- 第4次 reconfig List of nodes: 1 2
Fri Nov 25 10:55:17 2011 Reconfiguration started -- 第5次 reconfig List of nodes: 2
Fri Nov 25 10:55:18 2011 Reconfiguration complete
Fri Nov 25 10:55:19 2011 Shutting down instance (abort) -- abort 数据库
Fri Nov 25 10:56:47 2011 Starting ORACLE instance -- 正常启动数据库
###### sndb1 ######
Fri Nov 25 09:26:29 2011 IPC Send timeout detected.Sender: ospid 352952
Receiver: inst 3 binc -981855203 ospid 340396
Fri Nov 25 09:26:32 2011 ORA-27506: IPC error connecting to a port
Fri Nov 25 09:26:33 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 09:26:38 2011 IPC Send timeout detected.Sender: ospid 258258
Receiver: inst 3 binc -981855203 ospid 340396
Fri Nov 25 09:27:05 2011 ORA-27506: IPC error connecting to a port
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 09:42:09 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-03135: connection lost contact
ORA-03135: connection lost contact
Fri Nov 25 10:12:41 2011 Reconfiguration started -- 第1次 reconfig List of nodes: 0 2
Fri Nov 25 10:22:55 2011 IPC Send timeout detected.Sender: ospid 127746
Receiver: inst 3 binc -981855203 ospid 340396
.............
.............
Fri Nov 25 10:22:58 2011 IPC Send timeout detected.Sender: ospid 299594
Receiver: inst 3 binc -981855203 ospid 340396
Fri Nov 25 10:22:58 2011 ORA-27506: IPC error connecting to a port
Fri Nov 25 10:22:59 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []
.............
Fri Nov 25 10:23:25 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 10:28:26 2011 Reconfiguration started -- 第2次 reconfig List of nodes: 0 1 2
Fri Nov 25 10:33:55 2011 Reconfiguration complete
Fri Nov 25 10:39:14 2011 IPC Send timeout detected.Sender: ospid 377580
Receiver: inst 3 binc -981855203 ospid 340396
Fri Nov 25 10:39:18 2011 ORA-27506: IPC error connecting to a port
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 10:49:19 2011 ORA-29746: Cluster Synchronization Service is being shut down.
Fri Nov 25 10:49:20 2011 ORA-27300: OS system dependent operation:invalid_process failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgpstime3
ORA-27144: attempt to kill process failed
ORA-29746: Cluster Synchronization Service is being shut down.
Instance terminated by LMON, pid = 365348
Fri Nov 25 11:02:12 2011 Starting ORACLE instance (normal) -- 正常启动数据库实例 sndb1
###### sndb2 ######
Fri Nov 25 09:26:20 2011 Trace dumping is performing id=[cdmp_20111125092634]
Fri Nov 25 09:26:24 2011 IPC Send timeout detected.Sender: ospid 352604
Receiver: inst 3 binc -981855203 ospid 340396
Fri Nov 25 09:26:50 2011 ORA-27506: IPC error connecting to a port
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 10:04:24 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 10:04:25 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-03113: end-of-file on communication channel
Fri Nov 25 10:10:13 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-03135: connection lost contact
ORA-03135: connection lost contact
Fri Nov 25 10:12:23 2011 Shutting down instance (abort) -- abort 数据库
Fri Nov 25 10:12:23 2011 ORA-27300: OS system dependent operation:invalid_process failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgpstime3
ORA-27144: attempt to kill process failed
Fri Nov 25 10:12:58 2011 Instance terminated by USER, pid = 304124
Fri Nov 25 10:27:52 2011 Starting ORACLE instance (normal) -- 正常启动数据库
Fri Nov 25 10:28:10 2011 lmon registered with NM - instance id 2 (internal mem no 1)
Fri Nov 25 10:28:12 2011 Reconfiguration started -- 第1次 reconfig List of nodes:0 1 2
Fri Nov 25 10:33:23 2011 IPC Send timeout detected.Sender: ospid 529138
Fri Nov 25 10:33:26 2011 ORA-27506: IPC error connecting to a port
Fri Nov 25 10:33:34 2011 ORA-00482: LMD* process terminated with error
Fri Nov 25 10:33:36 2011 Shutting down instance (abort) -- 第2次 abort 数据库
...............
...............
Fri Nov 25 10:34:47 2011 ORA-27300: OS system dependent operation:invalid_process failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgpstime3
ORA-27144: attempt to kill process failed
Fri Nov 25 10:34:53 2011 Instance terminated by USER, pid = 446862
Fri Nov 25 10:49:39 2011 Starting ORACLE instance (normal) -- 第2次正常启动数据库
Fri Nov 25 10:49:57 2011 Reconfiguration started -- 第2次 reconfig List of nodes: 1 2
Fri Nov 25 10:56:39 2011 Instance terminated by USER, pid = 205414
Fri Nov 25 11:03:59 2011 Starting ORACLE instance (normal) -- 第3次正常启动数据库
Fri Nov 25 11:04:17 2011 Reconfiguration started -- 第3次 reconfig List of nodes: 0 1 2
zzzFri Nov 25 09:25:46 EAT 2011
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en3 1500 link#2 0.21.5e.6.8.37 56004124 0 948458 10 0
01:00:5e:00:00:01
en3 1500 192.168.0 192.168.0.3 56004124 0 948458 10 0
224.0.0.1
en0 1500 link#3 0.21.5e.6.8.f2 623180180 0 574030148 3 0
01:00:5e:00:00:01
en0 1500 10.10.10 10.10.10.12 623180180 0 574030148 3 0
224.0.0.1
en1 1500 link#4 0.21.5e.6.8.f3 0 0 1401015 1401015 0
01:00:5e:00:00:01
en1 1500 10.76.200 10.76.200.132 0 0 1401015 1401015 0
224.0.0.1
en2 1500 link#5 0.21.5e.6.8.36 435553601 0 119387543 3 0
01:00:5e:00:00:01
en2 1500 10.76.1 10.76.1.143 435553601 0 119387543 3 0
224.0.0.1
en2 1500 10.76.1 10.76.1.144 435553601 0 119387543 3 0
224.0.0.1
lo0 16896 link#1 14167657 0 14181993 0 0
lo0 16896 127 127.0.0.1 14167657 0 14181993 0 0
224.0.0.1
lo0 16896 ::1 14167657 0 14181993 0 0
ff01::1
ff02::1:ff00:1
ff02::1
icmp:
95502471 calls to icmp_error
0 errors not generated because old message was icmp
Output histogram:
echo reply: 4295876
destination unreachable: 226000
872 messages with bad code fields
0 messages < minimum length
12 bad checksums
0 messages with bad length
Input histogram:
echo reply: 1458209
destination unreachable: 56666
echo: 4295876
time exceeded: 191
4295876 message responses generated
igmp:
0 messages received
0 messages received with too few bytes
0 messages received with bad checksum
0 membership queries received
0 membership queries received with invalid field(s)
0 membership reports received
0 membership reports received with invalid field(s)
0 membership reports received for groups to which we belong
8 membership reports sent
tcp:
122743924 packets sent
116431735 data packets (2302053184 bytes)
25761 data packets (10003811 bytes) retransmitted
3553016 ack-only packets (3361721 delayed)
0 URG only packets
13 window probe packets
2423518 window update packets
309881 control packets
28566873 large sends
1403066833 bytes sent using largesend
65160 bytes is the biggest largesend
386118499 packets received
369946371 acks (for 2302274662 bytes)
74335 duplicate acks
0 acks for unsent data
98730125 packets (1777353999 bytes) received in-sequence
23016 completely duplicate packets (3194142 bytes)
0 old duplicate packets
20 packets with some dup. data (8594 bytes duped)
35986 out-of-order packets (2602936 bytes)
0 packets (0 bytes) of data after window
0 window probes
49947 window update packets
7 packets received after close
0 packets with bad hardware assisted checksum
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
13 discarded by listeners
0 discarded due to listener's queue full
283911590 ack packet headers correctly predicted
15339170 data packet headers correctly predicted
59734 connection requests
59629 connection accepts
83810 connections established (including accepts)
119948 connections closed (including 1831 drops)
0 connections with ECN capability
0 times responded to ECN
35446 embryonic connections dropped
351698548 segments updated rtt (of 112846079 attempts)
0 segments with congestion window reduced bit set
0 segments with congestion experienced bit set
0 resends due to path MTU discovery
36306 path MTU discovery terminations due to retransmits
203836 retransmit timeouts
439 connections dropped by rexmit timeout
189 fast retransmits
26 when congestion window less than 4 segments
134 newreno retransmits
27 times avoided false fast retransmits
30 persist timeouts
0 connections dropped due to persist timeout
46287 keepalive timeouts
9999 keepalive probes sent
36288 connections dropped by keepalive
0 times SACK blocks array is extended
0 times SACK holes array is extended
0 packets dropped due to memory allocation failure
0 connections in timewait reused
0 delayed ACKs for SYN
0 delayed ACKs for FIN
0 send_and_disconnects
0 spliced connections
0 spliced connections closed
0 spliced connections reset
0 spliced connections timeout
0 spliced connections persist timeout
0 spliced connections keepalive timeout
udp:
702526878 datagrams received
0 incomplete headers
3 bad data length fields
5 bad checksums
95502471 dropped due to no socket
47780240 broadcast/multicast datagrams dropped due to no socket
100987 socket buffer overflows <== Note here.
559143172 delivered
559613903 datagrams output
ip:
1106726708 total packets received
0 bad header checksums
0 with size smaller than minimum
0 with data size < data length
0 with header length < data size
0 with data length < header length
0 with bad options
0 with incorrect version number
13553929 fragments received
9 fragments dropped (dup or out of space)
841 fragments dropped after timeout
3122140 packets reassembled ok
1094399939 packets for this host
56393 packets for unknown/unsupported protocol
0 packets forwarded
207190 packets not forwardable
0 redirects sent
693606490 packets sent from this host
991 packets sent with fabricated ip header
0 output packets dropped due to no bufs, etc.
0 output packets discarded due to no route
4245736 output datagrams fragmented
20081318 fragments created
396 datagrams that can't be fragmented
1631395 IP Multicast packets dropped due to no receiver
0 successful path MTU discovery cycles
0 path MTU rediscovery cycles attempted
0 path MTU discovery no-response estimates
0 path MTU discovery response timeouts
0 path MTU discovery decreases detected
0 path MTU discovery packets sent
0 path MTU discovery memory allocation failures
0 ipintrq overflows
2 with illegal source
0 packets processed by threads
0 packets dropped by threads
14925 packets dropped due to the full socket receive buffer
0 dead gateway detection packets sent
0 dead gateway detection packet allocation failures
0 dead gateway detection gateway allocation failures
ipv6:
0 total packets received
0 with size smaller than minimum
0 with data size < data length
0 with incorrect version number
0 with illegal source
0 input packets without enough memory
0 fragments received
0 fragments dropped (dup or out of space)
0 fragments dropped after timeout
0 packets reassembled ok
0 packets for this host
0 packets for unknown/unsupported protocol
0 packets forwarded
0 packets not forwardable
0 too big packets not forwarded
0 packets sent from this host
0 packets sent with fabricated ipv6 header
0 output packets dropped due to no bufs, etc.
0 output packets without enough memory
0 output packets discarded due to no route
0 output datagrams fragmented
0 fragments created
0 packets dropped due to the full socket receive buffer
0 packets not delivered due to bad raw IPv6 checksum
icmpv6:
0 calls to icmp6_error
0 errors not generated because old message was icmpv6
Output histogram:
unreachable: 0
packets too big: 0
time exceeded: 0
parameter problems: 0
redirects: 0
echo requests: 0
echo replies: 0
group queries: 0
group reports: 0
group terminations: 0
router solicitations: 0
router advertisements: 0
neighbor solicitations: 0
neighbor advertisements: 0
0 messages with bad code fields
0 messages < minimum length
0 bad checksums
0 messages with bad length
Input histogram:
unreachable: 0
packets too big: 0
time exceeded: 0
parameter problems: 0
echo requests: 0
echo replies: 0
group queries: 0
bad group queries: 0
group reports: 0
bad group reports: 0
our groups' reports: 0
group terminations: 0
bad group terminations: 0
router solicitations: 0
bad router solicitations: 0
router advertisements: 0
bad router advertisements: 0
neighbor solicitations: 0
bad neighbor solicitations: 0
neighbor advertisements: 0
bad neighbor advertisements: 0
redirects: 0
bad redirects: 0
mobility calls when not started: 0
home agent address discovery requests: 0
bad home agent address discovery requests: 0
home agent address discovery replies: 0
bad home agent address discovery replies: 0
prefix solicitations: 0
bad prefix solicitations: 0
prefix advertisements: 0
bad prefix advertisements: 0
0 message responses generated
*** 2011-11-25 09:26:38.408
IPC Send timeout detected.Sender: ospid 258258
Receiver: inst 3 binc -981855203 ospid 340396
SKGXPCTX: 0x11053d158 ctx
wait delta 0 sec (14 msec) ctx ts 0x4b421 last ts 0x4b422
user cpu time since last wait 0 sec 0 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 1
timed wait receives 0
admno 0x3c67bf54 admport:
SSKGXPT 0x1053d4a4 flags active network 0
info for network 0
socket no 11 IP 10.10.10.10 UDP 36952
HACMP network_id 0 sflags SSKGXPT_UP
context timestamp 0x4b421
buffers queued on port 11057a888
sconno accono ertt state seq# sent async sync rtrans acks
ach accono sconno admno state seq# rcv rtrans acks
0x1105d9050 0x2c61e946 0x48f33837 0x1368ad14 40 32764 1 0 1
Submitting synchronized dump request [268435460]
*** 2011-11-25 09:27:05.144
ORA-27506: IPC error connecting to a port
Assertion failed: 0, file skgxpu.c, line 8568
*** 2011-11-25 09:27:05.144
NOTICE: Process received signal SIGLOST. Therefore, the call
stack does not point to the location of the problem.
Take a look at /var/adm/messages to find why this
process got SIGLOST
Exception signal: 6 0x9000000002832b4 (pthread_kill+0080) e8410028
Registers:
#if defined (SIGLOST) /* resource lost (eg, record-lock lost) */
signal_names[SIGLOST] = "SIGLOST";
#endif
###### sndb3 的lmon trace如下 ######
*** 2011-11-25 10:12:20.202
kjfcrfg: DRM window size = 2048->2048 (min lognb = 14)
*** 2011-11-25 10:12:20.210
Reconfiguration started (old inc 6, new inc 8)
Synchronization timeout interval: 1020 sec
List of nodes:
0 2
*** 2011-11-25 10:12:20.213
Global Resource Directory frozen
node 0
node 2
asby init, 0/0/x2
asby returns, 0/0/x2/false
* Domain maps before reconfiguration:
* DOMAIN 0 (valid 1): 0 1 2
* End of domain mappings
* kjbdomrcfg2: domain 0 invalid = TRUE
* Domain maps after recomputation:
* DOMAIN 0 (valid 0): 0 2
* End of domain mappings
Dead inst 1
Join inst
Exist inst 0 2
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
Active Sendback Threshold = 50 %
Communication channels reestablished
sent syncr inc 8 lvl 1 to 0 (8,5/0/0)
synca inc 8 lvl 1 rcvd (8.5)
replayed dom info (8.6)
received master 0 (8.6)
sent syncr inc 8 lvl 2 to 0 (8,7/0/0)
synca inc 8 lvl 2 rcvd (8.7)
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Set master node info
sent syncr inc 8 lvl 3 to 0 (8,13/0/0)
synca inc 8 lvl 3 rcvd (8.13)
Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 2204
sent syncr inc 8 lvl 4 to 0 (8,15/0/0)
synca inc 8 lvl 4 rcvd (8.15)
Dwn-cvts replayed, VALBLKs dubious
sent syncr inc 8 lvl 5 to 0 (8,18/0/0)
synca inc 8 lvl 5 rcvd (8.18)
All grantable enqueues granted
sent syncr inc 8 lvl 6 to 0 (8,20/0/0)
synca inc 8 lvl 6 rcvd (8.20)
*** 2011-11-25 10:12:22.061
Submitted all GCS cache requests
sent syncr inc 8 lvl 7 to 0 (8,22/0/0)
synca inc 8 lvl 7 rcvd (8.22)
Fix write in gcs resources
sent syncr inc 8 lvl 8 to 0 (8,24/0/0)
synca inc 8 lvl 8 rcvd (8.24)
*** 2011-11-25 10:12:23.459
SKGXPDOCON:connection refused: out of memory
SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.11 UDP port number 61800
SKGXPDOCON:connection refused: out of memory
SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.11 UDP port number 61800
*** 2011-11-25 09:26:20.449
SKGXPDOCON:connection refused: out of memory
SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.10 UDP port number 36992
SKGXPDOCON:connection refused: out of memory
SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.10 UDP port number 33548
*** 2011-11-25 09:27:15.598
Received ORADEBUG command 'dump errorstack 1' from process Unix process pid: 275164, image:
*** 2011-11-25 09:27:15.598
另外通过oraadebug ipc,我们也能看到rac心跳确实是使用的udp协议,如下:
Shared Memory:
ID KEY
10485768 0xe1892820
Maximum processes: = 750
ksxpdmp: facility 0 (?) (0x1, 0x0) counts 0, 0
ksxpdmp: Dumping the osd context
SKGXPCTX: 0x1105a4c78 ctx
wait delta 13 sec (13149 msec) ctx ts 0xb last ts 0xb
user cpu time since last wait -171798692 sec -1 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 0
timed wait receives 0
admno 0x104b9dec admport:
SSKGXPT 0x105a4fc4 flags SSKGXPT_READPENDING active network 0
info for network 0
socket no 7 IP 10.10.10.12 UDP 55895 <== Note here.
HACMP network_id 0 sflags SSKGXPT_UP
context timestamp 0xb
no ports
sconno accono ertt state seq# sent async sync rtrans acks
0x3115c6c7 0x6e0d721f 32 3 32765 2 2 0 0 2
另外os方面也没有发现有价值的信息,故障期间的 awr top5 如下:
关于ges/gcs enter server mode 这两个等待事件表现正常,表现为该实例在进行recover,smaon进程表现出的等待为该event。
下面是google到的解释:
In a RAC instance the during the instance recovery, first pass scan can be delayed by 300ms-1.5s waiting
on GRD (Global resource directory). During that time SMAON enters the wait event which is called "ges enter server mode" .
TOYS_SEQ_ZF 下存在SEQUENCE,通过如下方式进行修改:
ALTERSEQUENCE owner.sequence_name cache 1000;
检查SEQUENCE cache值:
SETLINES200SET pagesize 1000
col owner FOR a20
col SEQUENCE_NAME FOR a30
SELECT owner,sequence_name,MAX_VALUE,INCREMENT_BY,CYCLE_FLAG,ORDER_FLAG,CACHE_SIZE
FROM dba_sequence;