Networking and Router Problems in Red Hat Open Shift Container Platform

项目场景:

        一台服务器作为服务器端,与后端的2000台服务器agent进行通信。类似于oracle 的oms服务器端,与其他被监控的服务器的agent进行通信。

问题描述:

        服务器的网络经常性出现异常,网络异常时,无法通过ssh进行登录。但是过一段时间又自动恢复。且在一台服务器没有ssh到服务器端时,另外一台服务器可能正常ssh到服务器端。无法通过登录,登录报 connection error to xx.xx.xx.xx:22,过一段时间后恢复正常。服务器的/var/log/message  报 net_ratelimit: 69 callbacks suppressed.

原因分析:

服务器的地址缓存已达到服务器的设定值,导致新的服务器arp地址无法正常缓存。

服务器arp的缓冲量达到最大值。arp -a|wc -l 值为1024. 通过sysctl -a|grep net.ipv4.neigh.default.gc_thresh 的最大值,让后续的arp地址无法缓存,导致服务器无法正常连接。


解决方案:

在/etc/sysctl.conf添加以下三列:
net.ipv4.neigh.default.gc_thresh1 = 8192
net.ipv4.neigh.default.gc_thresh2 = 32768
net.ipv4.neigh.default.gc_thresh3 = 65536
保存,执行sysctl -p 生效。
gc_stale_time
决定检查一次相邻层记录的有效性的周期。当相邻层记录失效时,将在给它发送数据前,再解析一次。缺省值是60秒。
gc_thresh1
存在于ARP高速缓存中的最少层数,如果少于这个数,垃圾收集器将不会运行。缺省值是128。
gc_thresh2
保存在 ARP 高速缓存中的最多的记录软限制。垃圾收集器在开始收集前,允许记录数超过这个数字 5 秒。缺省值是 512。
gc_thresh3
保存在 ARP 高速缓存中的最多记录的硬限制,一旦高速缓存中的数目高于此,垃圾收集器将马上运行。缺省值是1024。

参考材料:

Environment
Red Hat OpenShift Container Platform 3.x
Issue
Our ha-proxies have a pretty high restart count (> 300 in a few days). As they do not log much, I can just provide you those outputs. Any ideas how to debug this further? In the logs we see:
Raw
1201 08:42:28.395691  1 ratelimiter.go:50] error reloading router: wait: no child processes

Occasionally we report pods which are failing to connect or resolve a service name within the OpenShift cluster. After a few seconds it starts to work again and we have no idea where this is coming from.
I'm having weird networking problems in my OpenShift cluster and I see the following messages reported on my nodes
Raw
Dec 06 09:39:11 ose3-node1 kernel: net_ratelimit: 119 callbacks suppressed
Dec 06 09:39:17 ose3-node1 kernel: net_ratelimit: 154 callbacks suppressed

Resolution
In Red Hat OpenShift Container Platform clusters with large numbers of routes (greater than the value of net.ipv4.neigh.default.gc_thresh3, which by default is 1024), it's necessary to increase the default values of sysctl variables to allow more entries in the ARP cache.
The following sysctl values are suggested for clusters with large numbers of routes:
Raw
net.ipv4.neigh.default.gc_thresh1 = 8192
net.ipv4.neigh.default.gc_thresh2 = 32768
net.ipv4.neigh.default.gc_thresh3 = 65536

To make these settings permanent across reboots, create a custom tuned profile.
Root Cause
Red Hat OpenShift Container Platform was overflowing the neighbor cache because the router wanted to talk to so many IP addresses on the local SDN network. The default neighbor (ARP) cache is 1024. So when it overflows and removes entries, it could remove an entry for a packet that was in the queue and the packet would get dropped.
Diagnostic Steps
Compare ARP table against kernel settings:
Raw
# ip neigh show | wc -l
1005
# sysctl -a | grep net.ipv4.neigh.default.gc_thresh
net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024

In the example above, ARP neighbors is close to threshold setting. Threshold setting would need to be raised in this case.
Product(s) Red Hat OpenShift Container Platform
Component haproxy kubernetes
Tags haproxy kernel  kubernetes openshift
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值