IPC Send timeout detected模拟和总结

       在多节点的集群日志中,经常会看到一些IPC Send timeout detected的报错,有的时候是偶尔报一次,数据库层面也看不出什么异常,如果一段时间大量出现了IPC Send timeout detected的报错,可能会导致实例被驱逐。

下面看一段因为IPC Send timeout detected超时导致实例被驱逐的日志

2019-07-09T04:00:56.552363+08:00
IPC Send timeout detected. Sender: ospid 330801 [oracle@BB001 (PING)]   <=====IPC Send timeout detected。Sender是实例的PING进程
Receiver: inst 4 binc 8 ospid 129583                                    <=====接收者是实例4的 ospid 129583进程
2019-07-09T04:04:10.694155+08:00
SGSGXDB1PDB(3):minact-scn: got error during useg scan e:12751 usn:11
SGSGXDB1PDB(3):minact-scn: useg scan erroring out with error e:12751
2019-07-09T04:06:20.571349+08:00
IPC Send timeout detected. Sender: ospid 331082 [oracle@BB001 (LCK0)]
Receiver: inst 4 binc 8 ospid 129871
2019-07-09T04:06:20.648844+08:00
Communications reconfiguration: instance_number 4 by ospid 331082
2019-07-09T04:06:32.085167+08:00
LMS1 (ospid: 330817) has detected no messaging activity from instance 4
USER (ospid: 330817) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2019-07-09T04:06:32.180821+08:00
LMON (ospid: 330811) drops the IMR request from LMS1 (ospid: 330817) because IMR is in progress and inst 4 is marked bad.
2019-07-09T04:06:49.447783+08:00
Evicting instance 4 from cluster                               <======实例4开始被驱逐
Waiting for instances to leave: 4
2019-07-09T04:07:00.267690+08:00
Dumping diagnostic data in directory=[cdmp_20190709040056], requested by (instance=4, osid=129674 (MMON)), summary=[abnormal instance termination].
2019-07-09T04:07:08.853865+08:00
LMON received an instance eviction notification from instance 3
The instance eviction reason is 0x20000000
The instance eviction map is 4
LMON received an instance eviction notification from instance 2
The instance eviction reason is 0x20000000
The instance eviction map is 4
2019-07-09T04:07:09.508223+08:00
Remote instance kill is issued with system inc 10
Remote instance kill map (size 1) : 4
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 4
2019-07-09T04:07:12.900020+08:00
Reconfiguration started (old inc 8, new inc 12)               <======开始做Reconfiguration
List of instances (total 3) :
 1 2 3
Dead instances (total 1) :
 4
My inst 1

在MOS查询IPC Send timeout detected报错相关的文档有如下
IPC Send timeout/node eviction etc with high packet reassembles failure (Doc ID 2008933.1)

Troubleshooting gc block lost and Poor Network Performance in a RAC Environment (Doc ID 563566.1)

根据Doc ID 2008933.1的描述

While this is happening, "netstat" shows huge jump of "packet reassembles failed": 

查询netstat可以看到有大量的packet reassembles failed报错,如下:

[oracle@BB001 oswnetstat]$ grep -ni 'packet reassembles failed' BB001_netstat_19.07.09.0400.dat
0508971 packet reassembles failed       <======有问题时存在大量的packet reassembles failed
0061284 packet reassembles failed
0595953 packet reassembles failed
0939239 packet reassembles failed
0942727 packet reassembles failed
0946213 packet reassembles failed
40949641 packet reassembles failed
40953041 packet reassembles failed
40956356 packet reassembles failed
40959776 packet reassembles failed
40963302 packet reassembles failed
40966844 packet reassembles failed
40970388 packet reassembles failed
40973952 packet reassembles failed
40977485 packet reassembles failed
40981039 packet reassembles failed
40984542 packet reassembles failed
40988124 packet reassembles failed
40991680 packet reassembles failed
40995193 packet reassembles failed
40998705 packet reassembles failed
41002205 packet reassembles failed
41005754 packet reassembles failed
41009364 packet reassembles failed
41012851 packet reassembles failed
41016349 packet reassembles failed
41019855 packet reassembles failed
41023392 packet reassembles failed
41026922 packet reassembles failed
41030468 packet reassembles failed
41033983 packet reassembles failed
41037427 packet reassembles failed
41040896 packet reassembles failed
41044446 packet reassembles failed
41047890 packet reassembles failed
41051359 packet reassembles failed
41054771 packet reassembles failed
41057762 packet reassembles failed
41061829 packet reassembles failed
41065435 packet reassembles failed
41069131 packet reassembles failed
41069818 packet reassembles failed
41069818 packet reassembles failed    
41069820 packet reassembles failed    <======包重组正常
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed
41069820 packet reassembles failed

      结论

        根据MTU(Maximum Transmission Unit)的尺寸,大的UDP数据包可能被分片,并在多个帧中发送。这些零散的数据包需要在接收节点上重新组合。高CPU使用率(持续的或者是频繁的峰值),过小的reassembly buffer或者UDP buffer也会导致块重组失败。在接收节点’netstat -s’输出的 "IP Statistics"部分提示有大量的Internet Protocol (IP)上的"reassembles failed" 和 "fragments dropped after timeout"信息。分片的报文需要在指定时间(time-to-live)内完成重组(reassemble)。没有能够完成重组的分片报文会被丢弃并要求重传。已经收到,但是由于空间不足没有进行重组的数据分片会被直接丢弃。

packet reassembles failed失败的解决办法,增加reassemble buffer尺寸,给重组分配更多的空间。方法如下:
在/etc/sysctl.conf中添加如下两行
net.ipv4.ipfrag_high_thresh = 16777216 (default = 196608)
net.ipv4.ipfrag_low_thresh = 15728640  (default = 262144)
//增加reassemble的时间
/proc/sys/net/ipv4/ipfrag_time (default = 30)

使参数生效
sysctl -p 

检查是否生效
sysctl -a|grep ipfrag

备注
/proc/sys/net/ipv4/ipfrag_low_thresh,/proc/sys/net/ipv4/ipfrag_low_thresh
两个文件分别表示用于重组IP分段的内存分配最低值和最高值,一旦达到最高内存分配值,其它分段将被丢弃,直到达到最低内存分配值。

2、增加MTU的大小,减少数据包分片

查看当前网卡MTU大小
ip -s link  | grep mtu | awk '{print $2 $5}'

修改方法,把mtu修改为9000
/sbin/ifconfig eth1 mtu 9000

 

//=========================================

下面是测试模拟IPC Send timeout detected等待事件的过程

修改两个节点的sysctl.conf中ipfrag的值
[root@rac1 ~]# cat /etc/sysctl.conf | grep ipfrag
net.ipv4.ipfrag_high_thresh = 66670
net.ipv4.ipfrag_low_thresh = 66570

修改两个节点私网的MTU大小900

root@rac1 ~]# ip -s link  | grep mtu | awk '{print $2 $5}'
lo:16436
eth0:1500
eth1:900     <=======大小900    
pan0:1500
[root@rac1 ~]# 
备注:

     eth1为私网网卡

//模拟gc的产生
-- 新建测试表1
create table scott.gc_test1 tablespace users as select * from dba_segments where rownum < 100001;
-- 插入数据
begin
for i in 1..100 loop
insert into  scott.gc_test1 select * from dba_segments where rownum < 100001;
commit;
end loop;
end;
/

-- 新建测试表2、测试表3、测试表4
create table scott.gc_test2 tablespace users as select * from scott.gc_test1;
create table scott.gc_test3 tablespace users as select * from scott.gc_test1;
create table scott.gc_test4 tablespace users as select * from scott.gc_test1;

 

## 节点1 update1.sh 脚本
#/bin/sh
. ~/.profile

sqlplus -S / as sysdba << END
begin
for i in 1..10 loop
update scott.gc_test1 set HEADER_FILE = HEADER_FILE + 1 where HEADER_FILE < 100000000;
dbms_lock.sleep(30);
commit;
end loop;
end;
/
exit;
END

其中update2.sh、update3.sh、update4.sh脚本和上面类似,把中间的表名替换即可。

 

## 节点2 select 脚本

[oracle@rac2]$cat select1.sh 
#/bin/sh
. ~/.bash_profile

for((i=1;i<=10000000;i++));
do
sqlplus scott/tiger << EOF
select HEADER_FILE,count(*) count from scott.gc_test1 group by HEADER_FILE order by HEADER_FILE;
exit
EOF
done

其中select2.sh、select3.sh、select4.sh脚本和上面类似,把中间的表名替换即可。

在节点1调用update1.sh、update2.sh、update3.sh、update4.sh。节点2调用select1.sh、select2.sh、select3.sh、select4.sh脚本,在运行开始后会出现gc等待,因为把mtu、重组包的大小调小了。运行一段时间后就会出现IPC TIMEOUT的情况(ipc timeout出现的时间不一定,有时候需要运行好久也不会出现,这个和测试脚本有关系,如果能模拟更大的gc情况,出现ipc timeout的情况会更容易),同时观察两个节点的等待事件,会发现存在gc lost相关的等待事件。

节点1数据库alert日志

Thu Nov 21 13:51:20 2019
IPC Send timeout detected. Receiver ospid 51038 [oracle@rac1.localdomain (LCK0)]
Thu Nov 21 13:51:20 2019
Errors in file /u01/app/oracle/diag/rdbms/devdb/devdb1/trace/devdb1_lck0_51038.trc:
Thu Nov 21 13:54:41 2019
IPC Send timeout detected. Receiver ospid 31091 [oracle@rac1.localdomain (PZ99)]
Thu Nov 21 13:54:41 2019
Errors in file /u01/app/oracle/diag/rdbms/devdb/devdb1/trace/devdb1_pz99_31091.trc:
Thu Nov 21 13:58:34 2019
Time drift detected. Please check VKTM trace file for more details.
Thu Nov 21 14:00:04 2019
IPC Send timeout detected. Receiver ospid 51038 [oracle@rac1.localdomain (LCK0)]
Thu Nov 21 14:00:04 2019
Errors in file /u01/app/oracle/diag/rdbms/devdb/devdb1/trace/devdb1_lck0_51038.trc:
Thu Nov 21 14:08:53 2019
IPC Send timeout detected. Receiver ospid 51038 [oracle@rac1.localdomain (LCK0)]
Thu Nov 21 14:08:53 2019
Errors in file /u01/app/oracle/diag/rdbms/devdb/devdb1/trace/devdb1_lck0_51038.trc:
Thu Nov 21 14:15:16 2019
IPC Send timeout detected. Receiver ospid 43909 [oracle@rac1.localdomain (PZ99)]
Thu Nov 21 14:15:16 2019
Errors in file /u01/app/oracle/diag/rdbms/devdb/devdb1/trace/devdb1_pz99_43909.trc:
Thu Nov 21 14:16:24 2019
minact-scn: useg scan erroring out with error e:12751
Thu Nov 21 14:16:36 2019
Reconfiguration started (old inc 4, new inc 6)
List of instances:
 1 (myinst: 1)

节点2数据库后台日志

Thu Nov 21 13:51:19 2019
IPC Send timeout detected. Sender: ospid 8916 [oracle@rac2.localdomain (LCK0)]
Receiver: inst 1 binc 2 ospid 51038
Thu Nov 21 13:54:34 2019
IPC Send timeout detected. Sender: ospid 11348 [oracle@rac2.localdomain (M000)]
Receiver: inst 1 binc 2 ospid 31091
Thu Nov 21 14:00:04 2019
IPC Send timeout detected. Sender: ospid 8916 [oracle@rac2.localdomain (LCK0)]
Receiver: inst 1 binc 2 ospid 51038
Thu Nov 21 14:01:39 2019
Restarting dead background process DIA0
Thu Nov 21 14:01:39 2019
DIA0 started with pid=10, OS id=11689
Thu Nov 21 14:08:48 2019
IPC Send timeout detected. Sender: ospid 8916 [oracle@rac2.localdomain (LCK0)]
Receiver: inst 1 binc 2 ospid 51038
Thu Nov 21 14:13:00 2019
Restarting dead background process DIA0
Thu Nov 21 14:13:00 2019
DIA0 started with pid=10, OS id=11892
Thu Nov 21 14:15:11 2019
IPC Send timeout detected. Sender: ospid 11812 [oracle@rac2.localdomain (M000)]
Receiver: inst 1 binc 2 ospid 43909
Thu Nov 21 14:16:27 2019
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/devdb/devdb2/trace/devdb2_asmb_8898.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:

 

  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值