mysql 集群 grra_Oracle 11gR2 RAC节点crash故障分析

最新推荐文章于 2021-04-05 03:37:49 发布

Hytvju

最新推荐文章于 2021-04-05 03:37:49 发布

阅读量394

点赞数

文章标签： mysql 集群 grra

本文链接：https://blog.csdn.net/weixin_36391993/article/details/114817275

版权

节点2的OSW PRVTNET日志：

zzz ***Tue Mar 25 13:14:39 BEIST 2014

trying to get source for 192.168.100.1

source should be 192.168.100.2

traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.2 (192.168.100.2), 30 hops max

outgoing MTU = 1500

1 * dbrac1-priv (192.168.100.1) 0 ms 0 ms

trying to get source for 192.168.100.2

source should be 192.168.100.2

traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.2 (192.168.100.2), 30 hops max

outgoing MTU = 1500

1 dbrac2-priv (192.168.100.2) 1 ms 0 ms 0 ms

zzz ***Warning. Traceroute response is spanning snapshot intervals.

zzz ***Tue Mar 25 13:14:51 BEIST 2014

trying to get source for 192.168.100.1

source should be 192.168.100.2

traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.2 (192.168.100.2), 30 hops max

outgoing MTU = 1500

1 * dbrac1-priv (192.168.100.1) 0 ms 0 ms

trying to get source for 192.168.100.2

source should be 192.168.100.2

traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.2 (192.168.100.2), 30 hops max

outgoing MTU = 1500

1 dbrac2-priv (192.168.100.2) 1 ms 0 ms 0 ms

zzz ***Tue Mar 25 13:14:56 BEIST 2014

trying to get source for 192.168.100.1

source should be 192.168.100.2

traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.2 (192.168.100.2), 30 hops max

outgoing MTU = 1500

1 * * dbrac1-priv (192.168.100.1) 0 ms

trying to get source for 192.168.100.2

source should be 192.168.100.2

traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.2 (192.168.100.2), 30 hops max

outgoing MTU = 1500

1 dbrac2-priv (192.168.100.2) 1 ms 0 ms 0 ms

zzz ***Warning. Traceroute response is spanning snapshot intervals.

zzz ***Tue Mar 25 13:15:08 BEIST 2014

trying to get source for 192.168.100.1

source should be 192.168.100.2

traceroute to 192.168.100.1 (192.168.100.1) from 192.168.100.2 (192.168.100.2), 30 hops max

outgoing MTU = 1500

1 dbrac1-priv (192.168.100.1) 1 ms * 0 ms

trying to get source for 192.168.100.2

source should be 192.168.100.2

traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.2 (192.168.100.2), 30 hops max

outgoing MTU = 1500

1 dbrac2-priv (192.168.100.2) 1 ms 0 ms 0 ms

zzz ***Warning. Traceroute response is spanning snapshot intervals.

分析如下：

实例2和实例1之间发生了IPC Send timeout，之后实例2被驱逐出了集群。

在traceroute中我们看到当时节点间私网通讯存在问题：

traceroute to 192.168.100.2 (192.168.100.2) from 192.168.100.1 (192.168.100.1), 30 hops max

outgoing MTU = 1500

1 * * * <===============================注：*代表traceroute不成功，3个*代表执行了3次网络交互

2 * * *

3 * * *

4 * * *

5 * * *

6 * * *

7 * * *

OSW netstat日志如下：

OSW

---------------------

dbrac1_netstat_14.03.25.1300.dat

zzz Tue Mar 25 13:14:33 BEIST 2014

udp:

439252958 datagrams received

0 incomplete headers

0 bad data length fields

0 bad checksums

328052 dropped due to no socket

4 broadcast/multicast datagrams dropped due to no socket

7289 socket buffer overflows

438917613 delivered

434514260 datagrams output

...

zzz Tue Mar 25 13:15:08 BEIST 2014

udp:

439253377 datagrams received

0 incomplete headers

0 bad data length fields

0 bad checksums

328066 dropped due to no socket

4 broadcast/multicast datagrams dropped due to no socket

7289 socket buffer overflows

438918018 delivered

434516319 datagrams output

OSW

---------------------

dbrac2_netstat_14.03.25.1300.dat

zzz Tue Mar 25 13:14:33 BEIST 2014

udp:

182516839 datagrams received

0 incomplete headers

0 bad data length fields

0 bad checksums

303810 dropped due to no socket

0 broadcast/multicast datagrams dropped due to no socket

0 socket buffer overflows

182213029 delivered

188182343 datagrams output

...

zzz Tue Mar 25 13:15:02 BEIST 2014

udp:

182517239 datagrams received

0 incomplete headers

0 bad data length fields

0 bad checksums

303819 dropped due to no socket

0 broadcast/multicast datagrams dropped due to no socket

0 socket buffer overflows

182213420 delivered

188182641 datagrams output

分析：udp上没有发现overflow, bad checksums/incomplete headers etc.

OSW vmstat日志如下：

OSW

-------------------------

dbrac1_vmstat_14.03.25.1300.dat

zzz ***Tue Mar 25 13:14:39 BEIST 2014

System configuration: lcpu=24 mem=49152MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

3 0 6927214 7214 0 0 0 0 0 0 60 3262132 9492 11 9 78 1

0 0 6926387 9432 0 0 0 1414 1413 0 72 3213613 9907 10 7 82 1

2 0 6926147 9652 0 0 0 0 0 0 56 3236071 7970 6 5 89 1

zzz ***Tue Mar 25 13:14:45 BEIST 2014

System configuration: lcpu=24 mem=49152MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

3 0 6922909 12885 0 0 0 0 0 0 94 3252893 10659 11 9 79 0

2 0 6922917 12856 0 0 0 0 0 0 50 3211865 8039 7 5 88 0

1 0 6922685 13069 0 0 0 0 0 0 49 3247836 7710 6 5 88 0

zzz ***Tue Mar 25 13:14:51 BEIST 2014

System configuration: lcpu=24 mem=49152MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

2 0 6922855 12894 0 0 0 0 0 0 126 3243290 9261 11 9 79 1

1 0 6922863 12865 0 0 0 0 0 0 34 3206410 7722 7 5 88 1

2 0 6924085 11623 0 0 0 0 0 0 45 3229507 7859 7 5 88 1

zzz ***Tue Mar 25 13:14:56 BEIST 2014

System configuration: lcpu=24 mem=49152MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

1 0 6922895 12779 0 0 0 0 0 0 79 3252256 8939 11 9 80 1

1 0 6922902 12750 0 0 0 0 0 0 44 3195964 7207 7 5 87 1

2 0 6922671 12929 0 0 0 0 0 0 80 3208503 7401 6 5 89 1

OSW

-------------------------

dbrac2_vmstat_14.03.25.1300.dat

zzz ***Tue Mar 25 13:14:39 BEIST 2014

System configuration: lcpu=24 mem=49151MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

2 0 4369629 7628806 0 0 0 0 0 0 57 3252089 7493 10 8 81 0

1 0 4369637 7628777 0 0 0 0 0 0 22 3201077 6327 6 4 89 1

1 0 4369406 7628989 0 0 0 0 0 0 32 3210307 7840 6 5 89 0

zzz ***Tue Mar 25 13:14:45 BEIST 2014

System configuration: lcpu=24 mem=49151MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

2 0 4369486 7628912 0 0 0 0 0 0 118 3252929 7799 11 9 79 1

1 1 4369410 7628967 0 0 0 0 0 0 304 3210820 9094 6 5 88 1

1 0 4369179 7629179 0 0 0 0 0 0 38 3215090 5315 6 4 89 1

zzz ***Tue Mar 25 13:14:51 BEIST 2014

System configuration: lcpu=24 mem=49151MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

3 0 4369636 7628720 0 0 0 0 0 0 55 3255957 9446 11 9 79 1

2 0 4369644 7628691 0 0 0 0 0 0 22 3219176 6673 6 4 89 1

1 0 4369413 7628903 0 0 0 0 0 0 27 3205112 5832 6 4 89 1

分析：run queue 不高，b不高，cpu使用率不高，没有pi/po

dbracdb1_lmon_6423022.trc：

2014-03-25 13:14:59.599871 : kjxgrcomerr: Communications reconfig: instance 2 (50,50)

*** 2014-03-25 13:14:59.687

2014-03-25 13:14:59.687302 : kjxgrrcfg: done - ret = 3 hist 0x1679a (initial rsn: 3)

kjxgrrcfgchk: Initiating reconfig, reason=3 <======================================reason 3: communication failure

kjxgrrcfgchk: COMM rcfg - Disk Vote Required

kjfmReceiverHealthCB_CheckAll: Recievers are healthy.<============================Recievers are healthy

2014-03-25 13:14:59.687425 : kjxgrnetchk: start 0x19b28646, end 0x19b33c90

2014-03-25 13:14:59.687444 : kjxgrnetchk: Network Validation wait: 46 sec

kjxgrnetchk: ce-event: from inst 1 to inst 2 ver 0x7e8aac60

kjxgrrcfgchk: prev pstate 6 mapsz 512

kjxgrrcfgchk: new bmp: 1 2

kjxgrrcfgchk: work bmp: 1 2

kjxgrrcfgchk: rr bmp: 1 2

*** 2014-03-25 13:14:59.687

kjxgmrcfg: Reconfiguration started, type 3

CGS/IMR TIMEOUTS:

CSS recovery timeout = 31 sec (Total CSS waittime = 65)

IMR Reconfig timeout = 75 sec

CGS rcfg timeout = 85 sec

kjxgmcs: Setting state to 50 0.

kjxgrs0h: disable CGS timeout

...

2014-03-25 13:15:42.511713 : kjxgrDiskVote: bitmap: 1

CGS/IMR TIMEOUTS:

CSS recovery timeout = 31 sec (Total CSS waittime = 65)

IMR Reconfig timeout = 75 sec

CGS rcfg timeout = 85 sec

kjxgmps: proposing substate 2

kjxgmcs: Setting state to 54 2.

kjfmSendAbortInstMsg: send an abort message to instance 2

dbracdb2_lmon_7995464.trc：

*** 2014-03-25 13:14:59.868

kjxgrrcfgchk: Initiating reconfig, reason=3

kjxgrrcfgchk: COMM rcfg - Disk Vote Required

kjfmReceiverHealthCB_CheckAll: Recievers are healthy.

2014-03-25 13:14:59.867489 : kjxgrnetchk: start 0xafd32909, end 0xafd3df53

2014-03-25 13:14:59.867509 : kjxgrnetchk: Network Validation wait: 46 sec

2014-03-25 13:14:59.867528 : kjxgrnetchk: Sending comm check req to inst 1

kjxgrrcfgchk: prev pstate 6 mapsz 512

kjxgrrcfgchk: new bmp: 1 2

kjxgrrcfgchk: work bmp: 1 2

kjxgrrcfgchk: rr bmp: 1 2

*** 2014-03-25 13:14:59.868

kjxgmrcfg: Reconfiguration started, type 3

CGS/IMR TIMEOUTS:

CSS recovery timeout = 31 sec (Total CSS waittime = 65)

IMR Reconfig timeout = 75 sec

CGS rcfg timeout = 85 sec

kjxgmcs: Setting state to 50 0.

kjxgrs0h: disable CGS timeout

...

2014-03-25 13:15:25.722114 : kjxgrdtrt: Evicted by inst 1, seq (52, 52)

IMR state information

Inst 2, thread 2, state 0x4:124c, flags 0x12ca9:0x0001

分析：

According to lmon trace, the instance2 was evicted by instance1 with reason 3: Communication Failure.

关于reason 3介绍如下:

Troubleshooting ORA-29740 in a RAC Environment (Doc ID 219361.1)

Reason 3: Communications Failure. This can happen if:

a) The LMON processes lose communication between one another.

b) One instance loses communications with the LMS, LMD, process of

another instance.

c) The LCK processes lose communication between one another.

d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck

and is not responding to remote requests.

Reason 0 = No reconfiguration

Reason 1 = The Node Monitor generated the reconfiguration.

Reason 2 = An instance death was detected.

Reason 3 = Communications Failure

Reason 4 = Reconfiguration after suspend

继续分析trace日志

dbracdb1_lms0_6160700.trc:

2014-03-25 13:13:03.149035 : GSIPC:PING: rcv'd PINGREQ[4] from 2.1 (seq 0.10358654) stm 0x45af2ac2

2014-03-25 13:13:03.149109 : GSIPC:PING: send PINGACK[4] to 2.1 (seq 0.10360096)

*** 2014-03-25 13:14:54.428

IPC Send timeout detected. Sender: ospid 6160700 [Oracle@dbrac1 (LMS0)] <============LMS0 of instance1 is the sender of the IPC

Receiver: inst 2 binc 291585594 ospid 11010320 <========= LMS0 of instance2 us the receiver of the IPC

dumping request handle 11123d958

krqh = 11123d958

...

Trace Bucket Dump End: KSI

Reporting Communication error with instance 2

dbracdb2_lms0_11010320.trc:

*** 2014-03-25 13:15:25.242

===== Idle Connection Overview =====

Idle connections [0x198ea]: 1

IdleConn List: 1[r:0.10358654,t:0x198ea]

GSIPC:IKILL: ping to inst 1 start 104682 now 104824 icktm 140 psm 0

Reporting Communication error with instance 1

分析:IPC的发送者跟接受者都报告互相通信错误。

分析总结如下：

目前我们可以确认实例2是由于实例间通讯超时，被实例1驱逐出了集群。

这次IPC的发送者是实例1的LMS0 ，接收者是实例2的LMS0，这两个进程当时都是正常的，而且LMON也检查了，发现recevier is healthy。

LMON, 两个实例的LMS0都报了Communication Failure/Error，说明它们是由于通讯失败导致了问题。

从OSW看，当时两个节点的资源没有问题，但是traceroute中报了* * *，说明私网通讯有问题。

解决方法：

通过更换私网的光纤，问题解决，后持续跟进验证数据库恢复正常！