oracle ipc message,【案例】Oracle RAC IPC send timeout error导致RAC的节点挂起解决办法

天萃荷净

Oracle研究中心案例分析:运维DBA反映Oracle RAC环境数据库节点挂起,分享日志发现是由于IPC send timeout error导致RAC的节点挂起。

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: “IPC send timeout error” 导致RAC的节点挂起

一同事管理的一套3 nodes rac,上周出现故障,由于其中一个节点出现IPC send timeout,导致另外2个node挂住,应用无法访问,最后只能重启node。

根据 alert 整理了一个简单的时间列表:

###### sndb3 ######

Fri Nov 25 09:26:08 2011   IPC Send timeout detected. Receiver ospid 340396   -- sndb3

..........

Fri Nov 25 09:28:54 2011   IPC Send timeout detected. Receiver ospid 340396   -- sndb3

Fri Nov 25 10:04:19 2011   Trace dumping is performing id=[cdmp_20111125100426]

Fri Nov 25 10:10:07 2011   Warning: cluster_database_instances (2) of node 0 is <= my node id (2)

Fri Nov 25 10:12:20 2011   Reconfiguration started  -- 开始  reconfig 此时 List of nodes: 0 2

Fri Nov 25 10:28:05 2011   Reconfiguration started  -- 第2次 reconfig List of nodes: 0 1 2

Fri Nov 25 10:49:03 2011   Reconfiguration started  -- 第3次 reconfig List of nodes: 2

Fri Nov 25 10:49:50 2011   Reconfiguration started  -- 第4次 reconfig List of nodes: 1 2

Fri Nov 25 10:55:17 2011   Reconfiguration started  -- 第5次 reconfig List of nodes: 2

Fri Nov 25 10:55:18 2011   Reconfiguration complete

Fri Nov 25 10:55:19 2011   Shutting down instance (abort)  -- abort 数据库

Fri Nov 25 10:56:47 2011   Starting ORACLE instance -- 正常启动数据库

###### sndb1 ######

Fri Nov 25 09:26:29 2011   IPC Send timeout detected.Sender: ospid 352952

Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 09:26:32 2011   ORA-27506: IPC error connecting to a port

Fri Nov 25 09:26:33 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 09:26:38 2011   IPC Send timeout detected.Sender: ospid 258258

Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 09:27:05 2011   ORA-27506: IPC error connecting to a port

ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 09:42:09 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []

ORA-03135: connection lost contact

ORA-03135: connection lost contact

Fri Nov 25 10:12:41 2011   Reconfiguration started  -- 第1次 reconfig List of nodes: 0 2

Fri Nov 25 10:22:55 2011   IPC Send timeout detected.Sender: ospid 127746

Receiver: inst 3 binc -981855203 ospid 340396

.............

.............

Fri Nov 25 10:22:58 2011   IPC Send timeout detected.Sender: ospid 299594

Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 10:22:58 2011   ORA-27506: IPC error connecting to a port

Fri Nov 25 10:22:59 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []

.............

Fri Nov 25 10:23:25 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:28:26 2011   Reconfiguration started -- 第2次 reconfig List of nodes: 0 1 2

Fri Nov 25 10:33:55 2011   Reconfiguration complete

Fri Nov 25 10:39:14 2011   IPC Send timeout detected.Sender: ospid 377580

Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 10:39:18 2011   ORA-27506: IPC error connecting to a port

ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:49:19 2011   ORA-29746: Cluster Synchronization Service is being shut down.

Fri Nov 25 10:49:20 2011   ORA-27300: OS system dependent operation:invalid_process failed with status: 0

ORA-27301: OS failure message: Error 0

ORA-27302: failure occurred at: skgpstime3

ORA-27144: attempt to kill process failed

ORA-29746: Cluster Synchronization Service is being shut down.

Instance terminated by LMON, pid = 365348

Fri Nov 25 11:02:12 2011   Starting ORACLE instance (normal) -- 正常启动数据库实例 sndb1

###### sndb2 ######

Fri Nov 25 09:26:20 2011   Trace dumping is performing id=[cdmp_20111125092634]

Fri Nov 25 09:26:24 2011   IPC Send timeout detected.Sender: ospid 352604

Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 09:26:50 2011   ORA-27506: IPC error connecting to a port

ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:04:24 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:04:25 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []

ORA-03113: end-of-file on communication channel

Fri Nov 25 10:10:13 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []

ORA-03135: connection lost contact

ORA-03135: connection lost contact

Fri Nov 25 10:12:23 2011   Shutting down instance (abort)  -- abort 数据库

Fri Nov 25 10:12:23 2011   ORA-27300: OS system dependent operation:invalid_process failed with status: 0

ORA-27301: OS failure message: Error 0

ORA-27302: failure occurred at: skgpstime3

ORA-27144: attempt to kill process failed

Fri Nov 25 10:12:58 2011   Instance terminated by USER, pid = 304124

Fri Nov 25 10:27:52 2011   Starting ORACLE instance (normal) -- 正常启动数据库

Fri Nov 25 10:28:10 2011   lmon registered with NM - instance id 2 (internal mem no 1)

Fri Nov 25 10:28:12 2011   Reconfiguration started  -- 第1次 reconfig  List of nodes:0 1 2

Fri Nov 25 10:33:23 2011   IPC Send timeout detected.Sender: ospid 529138

Fri Nov 25 10:33:26 2011   ORA-27506: IPC error connecting to a port

Fri Nov 25 10:33:34 2011   ORA-00482: LMD* process terminated with error

Fri Nov 25 10:33:36 2011   Shutting down instance (abort)  -- 第2次 abort 数据库

...............

...............

Fri Nov 25 10:34:47 2011   ORA-27300: OS system dependent operation:invalid_process failed with status: 0

ORA-27301: OS failure message: Error 0

ORA-27302: failure occurred at: skgpstime3

ORA-27144: attempt to kill process failed

Fri Nov 25 10:34:53 2011   Instance terminated by USER, pid = 446862

Fri Nov 25 10:49:39 2011   Starting ORACLE instance (normal) -- 第2次正常启动数据库

Fri Nov 25 10:49:57 2011   Reconfiguration started -- 第2次 reconfig List of nodes: 1 2

Fri Nov 25 10:56:39 2011   Instance terminated by USER, pid = 205414

Fri Nov 25 11:03:59 2011   Starting ORACLE instance (normal) -- 第3次正常启动数据库

Fri Nov 25 11:04:17 2011   Reconfiguration started -- 第3次 reconfig List of nodes: 0 1 2

整体的梳理情况如上,据同事反映,该rac在25号9~10点之间,由于报错ipc send out,导致1,2两个node hang住,他的直接描述是应用都无法连接1,2 node了,最后他通过srvctl强制停掉1,2 节点。

我们知道,出现ipc超时,而已引起系统系统挂住的情况,可能有如下几种:

1. cpu负载很高

2. 内存不足

3. 网络问题,如心跳异常

4. 系统参数设置,比如udp,tcp相关参数设置过小

5. crs和db版本不一致(这点我不确定,是因为这里的情况是crs 10.2.0.4, db 10.2.0.1), 这点是我临时加上的。

下面我们来一一排查进行分析:

sndb3 节点在故障之间的 osw 监控信息如下:

zzz ***Fri Nov 25 09:25:46 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr    memory              page              faults        cpu

----- ----------- ------------------------ ------------ -----------

r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa

0  0 4002889 11233515   0   0   0   0    0   0 6238 240716 20451 11  6 81  2

2  0 4002865 11233537   0   0   0   0    0   0 6215 240551 21467 10  6 82  2

2  0 4002895 11233502   0   0   0   0    0   0 5594 239027 19584 11  5 82  2

zzz ***Fri Nov 25 09:26:16 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr    memory              page              faults        cpu

----- ----------- ------------------------ ------------ -----------

r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa

0  1 4002579 11229653   0   0   0   0    0   0 1950 31763 8699  4  2 88  5

0  1 4002657 11229073   0   0   0   0    0   0 2325 27718 9141  3  2 90  5

2  2 4002661 11228588   0   0   0   0    0   0 1891 22869 8332  4  1 90  5

zzz ***Fri Nov 25 09:26:46 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr    memory              page              faults        cpu

----- ----------- ------------------------ ------------ -----------

r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa

1  0 4009439 11223429   0   0   0   0    0   0 820 21974 4532  4  2 95  0

1  0 4009445 11223419   0   0   0   0    0   0 797 13064 4497  3  1 96  0

1  0 4009445 11223416   0   0   0   0    0   0 621 11103 3988  3  1 96  0

zzz ***Fri Nov 25 09:27:16 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr    memory              page              faults        cpu

----- ----------- ------------------------ ------------ -----------

r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa

3  1 4012753 11217337   0   0   0   0    0   0 303 229232 3042  6 20 73  0

2  0 4012780 11216595   0   0   0   0    0   0 612 107627 3761  5  9 86  0

2  0 4012780 11216592   0   0   0   0    0   0 770 12774 4312  3  1 96  0

从上可以看到,sndb3 在ipc send超时出现时,系统cpu和内存均非常空闲,下面来检查网络:

zzzFri Nov 25 09:25:46 EAT 2011

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll

en3   1500  link#2      0.21.5e.6.8.37    56004124     0   948458    10     0

01:00:5e:00:00:01

en3   1500  192.168.0   192.168.0.3       56004124     0   948458    10     0

224.0.0.1

en0   1500  link#3      0.21.5e.6.8.f2    623180180     0 574030148     3     0

01:00:5e:00:00:01

en0   1500  10.10.10    10.10.10.12       623180180     0 574030148     3     0

224.0.0.1

en1   1500  link#4      0.21.5e.6.8.f3           0     0  1401015 1401015     0

01:00:5e:00:00:01

en1   1500  10.76.200   10.76.200.132            0     0  1401015 1401015     0

224.0.0.1

en2   1500  link#5      0.21.5e.6.8.36    435553601     0 119387543     3     0

01:00:5e:00:00:01

en2   1500  10.76.1     10.76.1.143       435553601     0 119387543     3     0

224.0.0.1

en2   1500  10.76.1     10.76.1.144       435553601     0 119387543     3     0

224.0.0.1

lo0   16896 link#1                        14167657     0 14181993     0     0

lo0   16896 127         127.0.0.1         14167657     0 14181993     0     0

224.0.0.1

lo0   16896 ::1                           14167657     0 14181993     0     0

ff01::1

ff02::1:ff00:1

ff02::1

icmp:

95502471 calls to icmp_error

0 errors not generated because old message was icmp

Output histogram:

echo reply: 4295876

destination unreachable: 226000

872 messages with bad code fields

0 messages < minimum length

12 bad checksums

0 messages with bad length

Input histogram:

echo reply: 1458209

destination unreachable: 56666

echo: 4295876

time exceeded: 191

4295876 message responses generated

igmp:

0 messages received

0 messages received with too few bytes

0 messages received with bad checksum

0 membership queries received

0 membership queries received with invalid field(s)

0 membership reports received

0 membership reports received with invalid field(s)

0 membership reports received for groups to which we belong

8 membership reports sent

tcp:

122743924 packets sent

116431735 data packets (2302053184 bytes)

25761 data packets (10003811 bytes) retransmitted

3553016 ack-only packets (3361721 delayed)

0 URG only packets

13 window probe packets

2423518 window update packets

309881 control packets

28566873 large sends

1403066833 bytes sent using largesend

65160 bytes is the biggest largesend

386118499 packets received

369946371 acks (for 2302274662 bytes)

74335 duplicate acks

0 acks for unsent data

98730125 packets (1777353999 bytes) received in-sequence

23016 completely duplicate packets (3194142 bytes)

0 old duplicate packets

20 packets with some dup. data (8594 bytes duped)

35986 out-of-order packets (2602936 bytes)

0 packets (0 bytes) of data after window

0 window probes

49947 window update packets

7 packets received after close

0 packets with bad hardware assisted checksum

0 discarded for bad checksums

0 discarded for bad header offset fields

0 discarded because packet too short

13 discarded by listeners

0 discarded due to listener's queue full

283911590 ack packet headers correctly predicted

15339170 data packet headers correctly predicted

59734 connection requests

59629 connection accepts

83810 connections established (including accepts)

119948 connections closed (including 1831 drops)

0 connections with ECN capability

0 times responded to ECN

35446 embryonic connections dropped

351698548 segments updated rtt (of 112846079 attempts)

0 segments with congestion window reduced bit set

0 segments with congestion experienced bit set

0 resends due to path MTU discovery

36306 path MTU discovery terminations due to retransmits

203836 retransmit timeouts

439 connections dropped by rexmit timeout

189 fast retransmits

26 when congestion window less than 4 segments

134 newreno retransmits

27 times avoided false fast retransmits

30 persist timeouts

0 connections dropped due to persist timeout

46287 keepalive timeouts

9999 keepalive probes sent

36288 connections dropped by keepalive

0 times SACK blocks array is extended

0 times SACK holes array is extended

0 packets dropped due to memory allocation failure

0 connections in timewait reused

0 delayed ACKs for SYN

0 delayed ACKs for FIN

0 send_and_disconnects

0 spliced connections

0 spliced connections closed

0 spliced connections reset

0 spliced connections timeout

0 spliced connections persist timeout

0 spliced connections keepalive timeout

udp:

702526878 datagrams received

0 incomplete headers

3 bad data length fields

5 bad checksums

95502471 dropped due to no socket

47780240 broadcast/multicast datagrams dropped due to no socket

100987 socket buffer overflows <== Note here.

559143172 delivered

559613903 datagrams output

ip:

1106726708 total pachttp://www.oracleplus.netkets received

0 bad header checksums

0 with size smaller than minimum

0 with data size < data length

0 with header length < data size

0 with data length < header length

0 with bad options

0 with incorrect version number

13553929 fragments received

9 fragments dropped (dup or out of space)

841 fragments dropped after timeout

3122140 packets reassembled ok

1094399939 packets for this host

56393 packets for unknown/unsupported protocol

0 packets forwarded

207190 packets not forwardable

0 redirects sent

693606490 packets sent from this host

991 packets sent with fabricated ip header

0 output packets dropped due to no bufs, etc.

0 output packets discarded due to no route

4245736 output datagrams fragmented

20081318 fragments created

396 datagrams that can't be fragmented

1631395 IP Multicast packets dropped due to no receiver

0 successful path MTU discovery cycles

0 path MTU rediscovery cycles attempted

0 path MTU discovery no-response estimates

0 path MTU discovery response timeouts

0 path MTU discovery decreases detected

0 path MTU discovery packets sent

0 path MTU discovery memory allocation failures

0 ipintrq overflows

2 with illegal source

0 packets processed by threads

0 packets dropped by threads

14925 packets dropped due to the full socket receive buffer

0 dead gateway detection packets sent

0 dead gateway detection packet allocation failures

0 dead gateway detection gateway allocation failures

ipv6:

0 total packets received

0 with size smaller than minimum

0 with data size < data length

0 with incorrect version number

0 with illegal source

0 input packets without enough memory

0 fragments received

0 fragments dropped (dup or out of space)

0 fragments dropped after timeout

0 packets reassembled ok

0 packets for this host

0 packets for unknown/unsupported protocol

0 packets forwarded

0 packets not forwardable

0 too big packets not forwarded

0 packets sent from this host

0 packets sent with fabricated ipv6 header

0 output packets dropped due to no bufs, etc.

0 output packets without enough memory

0 output packets discarded due to no route

0 output datagrams fragmented

0 fragments created

0 packets dropped due to the full socket receive buffer

0 packets not delivered due to bad raw IPv6 checksum

icmpv6:

0 calls to icmp6_error

0 errors not generated because old message was icmpv6

Output histogram:

unreachable: 0

packets too big: 0

time exceeded: 0

parameter problems: 0

redirects: 0

echo requests: 0

echo replies: 0

group queries: 0

group reports: 0

group terminations: 0

router solicitations: 0

router advertisements: 0

neighbor solicitations: 0

neighbor advertisements: 0

0 messages with bad code fields

0 messages < minimum length

0 bad checksums

0 messages with bad length

Input histogram:

unreachable: 0

packets too big: 0

time exceeded: 0

parameter problems: 0

echo requests: 0

echo replies: 0

group queries: 0

bad group queries: 0

group reports: 0

bad group reports: 0

our groups' reports: 0

group terminations: 0

bad group terminations: 0

router solicitations: 0

bad router solicitations: 0

router advertisements: 0

bad router advertisements: 0

neighbor solicitations: 0

bad neighbor solicitations: 0

neighbor advertisements: 0

bad neighbor advertisements: 0

redirects: 0

bad redirects: 0

mobility calls when not started: 0

home agent address discovery requests: 0

bad home agent address discovery requests: 0

home agent address discovery replies: 0

bad home agent address discovery replies: 0

prefix solicitations: 0

bad prefix solicitations: 0

prefix advertisements: 0

bad prefix advertisements: 0

0 message responses generated

从上面的 100987 socket buffer overflows 信息来看,似乎有包溢出,从这里看可能是udp相关参数设置有问题。

下面再来分析下crs的日志以及lmd,lmon等关键性进程的trace:

*** 2011-11-25 09:26:38.408

IPC Send timeout detected.Sender: ospid 258258

Receiver: inst 3 binc -981855203 ospid 340396

SKGXPCTX: 0x11053d158 ctx

wait delta 0 sec (14 msec) ctx ts 0x4b421 last ts 0x4b422

user cpu time since last wait 0 sec 0 ticks

system cpu time since last wait 0 sec 0 ticks

locked 1

blocked 1

timed wait receives 0

admno 0x3c67bf54 admport:

SSKGXPT 0x1053d4a4 flags  active network 0

info for network 0

socket no 11  IP 10.10.10.10  UDP 36952

HACMP network_id 0 sflags SSKGXPT_UP

context timestamp 0x4b421

buffers queued on port 11057a888

sconno     accono   ertt  state   seq#   sent  async   sync rtrans   acks

ach     accono     sconno      admno  state   seq#    rcv rtrans   acks

0x1105d9050 0x2c61e946 0x48f33837 0x1368ad14     40  32764      1      0      1

Submitting synchronized dump request [268435460]

*** 2011-11-25 09:27:05.144

ORA-27506: IPC error connecting to a port

Assertion failed: 0, file  skgxpu.c, line 8568

*** 2011-11-25 09:27:05.144

NOTICE: Process received signal SIGLOST. Therefore, the call

stack does not point to the location of the problem.

Take a look at /var/adm/messages to find why this

process got SIGLOST

Exception signal: 6 0x9000000002832b4 (pthread_kill+0080) e8410028

Registers:

#if defined (SIGLOST) /* resource lost (eg, record-lock lost) */

signal_names[SIGLOST] = "SIGLOST";

#endif

###### sndb3 的lmon trace如下 ######

*** 2011-11-25 10:12:20.202

kjfcrfg: DRM window size = 2048->2048 (min lognb = 14)

*** 2011-11-25 10:12:20.210

Reconfiguration started (old inc 6, new inc 8)

Synchronization timeout interval: 1020 sec

List of nodes:

0 2

*** 2011-11-25 10:12:20.213

Global Resource Directory frozen

node 0

node 2

asby init, 0/0/x2

asby returns, 0/0/x2/false

* Domain maps before reconfiguration:

*   DOMAIN 0 (valid 1): 0 1 2

* End of domain mappings

* kjbdomrcfg2: domain 0 invalid = TRUE

* Domain maps after recomputation:

*   DOMAIN 0 (valid 0): 0 2

* End of domain mappings

Dead  inst 1

Join  inst

Exist inst 0 2

KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6

KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6

KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6

KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6

KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6

Active Sendback Threshold = 50 %

Communication channels reestablished

sent syncr inc 8 lvl 1 to 0 (8,5/0/0)

synca inc 8 lvl 1 rcvd (8.5)

replayed dom info (8.6)

received master 0 (8.6)

sent syncr inc 8 lvl 2 to 0 (8,7/0/0)

synca inc 8 lvl 2 rcvd (8.7)

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Set master node info

sent syncr inc 8 lvl 3 to 0 (8,13/0/0)

synca inc 8 lvl 3 rcvd (8.13)

Submitted all remote-enqueue requests

kjfcrfg: Number of mesgs sent to node 0 = 2204

sent syncr inc 8 lvl 4 to 0 (8,15/0/0)

synca inc 8 lvl 4 rcvd (8.15)

Dwn-cvts replayed, VALBLKs dubious

sent syncr inc 8 lvl 5 to 0 (8,18/0/0)

synca inc 8 lvl 5 rcvd (8.18)

All grantable enqueues granted

sent syncr inc 8 lvl 6 to 0 (8,20/0/0)

synca inc 8 lvl 6 rcvd (8.20)

*** 2011-11-25 10:12:22.061

Submitted all GCS cache requests

sent syncr inc 8 lvl 7 to 0 (8,22/0/0)

synca inc 8 lvl 7 rcvd (8.22)

Fix write in gcs resources

sent syncr inc 8 lvl 8 to 0 (8,24/0/0)

synca inc 8 lvl 8 rcvd (8.24)

*** 2011-11-25 10:12:23.459

其他2个node的crsd ocssd 以及lmon lmd等trace也没有发现有价值的信息,而且在9:26~10:12之间都没任何信息。

只有lmd trace中有点有价值的信息:

SKGXPDOCON:connection refused: out of  memory

SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.11  UDP port number 61800

SKGXPDOCON:connection refused: out of  memory

SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.11  UDP port number 61800

*** 2011-11-25 09:26:20.449

SKGXPDOCON:connection refused: out of  memory

SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.10  UDP port number 36992

SKGXPDOCON:connection refused: out of  memory

SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.10  UDP port number 33548

*** 2011-11-25 09:27:15.598

Received ORADEBUG command 'dump errorstack 1' from process Unix process pid: 275164, image:

*** 2011-11-25 09:27:15.598

另外通过oraadebug ipc,我们也能看到rac心跳确实是使用的udp协议,如下:

Shared Memory:

ID  KEY

10485768 0xe1892820

Maximum processes:               = 750

ksxpdmp:  facility 0 ( ) (0x1, 0x0) counts 0, 0

ksxpdmp:  Dumping the osd context

SKGXPCTX: 0x1105a4c78 ctx

wait delta 13 sec (13149 msec) ctx ts 0xb last ts 0xb

user cpu time since last wait -171798692 sec -1 ticks

system cpu time since last wait 0 sec 0 ticks

locked 1

blocked 0

timed wait receives 0

admno 0x104b9dec admport:

SSKGXPT 0x105a4fc4 flags SSKGXPT_READPENDING active network 0

info for network 0

socket no 7  IP 10.10.10.12  UDP 55895 <== Note here.

HACMP network_id 0 sflags SSKGXPT_UP

context timestamp 0xb

no ports

sconno     accono   ertt  state   seq#   sent  async   sync rtrans   acks

0x3115c6c7 0x6e0d721f     32      3  32765      2      2      0      0      2

另外os方面也没有发现有价值的信息,故障期间的 awr top5 如下:

d5f86a5152158a8a8f67b8e360e1983f.png

关于ges/gcs enter server mode 这两个等待事件表现正常,表现为该实例在进行recover,smaon进程表现出的等待为该event。

下面是google到的解释:

In a RAC instance the during the instance recovery, first pass scan can be delayed by 300ms-1.5s waiting

on GRD (Global resource directory). During that time SMAON enters the wait event which is called "ges enter server mode" .

另外有点值得关注是,SQ相关等待较高。

通过下面的Global Cache and Enqueue Services - Messaging Statistics 也能看出cluster interconnect 通信消耗还是挺大的。

另外mos上找到一个相关的文档,不过跟这个情况看上去不符合,该文档如下:

IPC Send timeout detected and Instance eviction on RDBMS and ASM 10.2.0.1 RAC [ID 370182.1]

最后给出的邮件回复如下:

该参数设置过小可能会导致rac node挂起, 故认为第2种可能性更大。

当前系统该值为:

udp_recvspace = 655360

udp_sendspace = 65536

根据官方文档,建议修改为如下值:

udp_recvspace = 2621440(该参数可以不调整)

udp_sendspace = 262144

udp_sendspace 官方文档推荐值是(DB_BLOCK_SIZE * DB_MULTIBLOCK_READ_COUNT) + 4 KB

参数调整以后,要监控一段时间。

补充:从awr来看,top 5中 enq: SQ - contention 等待较高,建议调大sequence的cache值(默认值20较小)

TOYS_SEQ_ZF 下存在SEQUENCE,通过如下方式进行修改:

ALTER SEQUENCE owner.sequence_name cache 1000;

检查SEQUENCE cache值:

SET LINES 200

SET pagesize 1000

col owner FOR a20

col SEQUENCE_NAME FOR a30

SELECT owner,sequence_name,MAX_VALUE,INCREMENT_BY,CYCLE_FLAG,ORDER_FLAG,CACHE_SIZE

FROM dba_sequence;

--------------------------------------ORACLE-DBA----------------------------------------

最权威、专业的Oracle案例资源汇总之【案例】Oracle RAC IPC send timeout error导致RAC的节点挂起解决办法

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值