Networking and Router Problems in Red Hat Open Shift Container Platform

最新推荐文章于 2023-09-09 11:23:48 发布

33号球手

最新推荐文章于 2023-09-09 11:23:48 发布

阅读量164

点赞数

文章标签： linux 运维

本文链接：https://blog.csdn.net/sinat_29220797/article/details/120215587

版权

项目场景：

一台服务器作为服务器端，与后端的2000台服务器agent进行通信。类似于oracle 的oms服务器端，与其他被监控的服务器的agent进行通信。

问题描述：

服务器的网络经常性出现异常，网络异常时，无法通过ssh进行登录。但是过一段时间又自动恢复。且在一台服务器没有ssh到服务器端时，另外一台服务器可能正常ssh到服务器端。无法通过登录，登录报 connection error to xx.xx.xx.xx:22，过一段时间后恢复正常。服务器的/var/log/message 报 net_ratelimit: 69 callbacks suppressed.

原因分析：

服务器的地址缓存已达到服务器的设定值，导致新的服务器arp地址无法正常缓存。

服务器arp的缓冲量达到最大值。arp -a|wc -l 值为1024. 通过sysctl -a|grep net.ipv4.neigh.default.gc_thresh 的最大值，让后续的arp地址无法缓存，导致服务器无法正常连接。

解决方案：

在/etc/sysctl.conf添加以下三列：
net.ipv4.neigh.default.gc_thresh1 = 8192
net.ipv4.neigh.default.gc_thresh2 = 32768
net.ipv4.neigh.default.gc_thresh3 = 65536
保存，执行sysctl -p 生效。
gc_stale_time
决定检查一次相邻层记录的有效性的周期。当相邻层记录失效时，将在给它发送数据前，再解析一次。缺省值是60秒。
gc_thresh1
存在于ARP高速缓存中的最少层数，如果少于这个数，垃圾收集器将不会运行。缺省值是128。
gc_thresh2
保存在 ARP 高速缓存中的最多的记录软限制。垃圾收集器在开始收集前，允许记录数超过这个数字 5 秒。缺省值是 512。
gc_thresh3
保存在 ARP 高速缓存中的最多记录的硬限制，一旦高速缓存中的数目高于此，垃圾收集器将马上运行。缺省值是1024。

参考材料：

Environment
Red Hat OpenShift Container Platform 3.x
Issue
Our ha-proxies have a pretty high restart count (> 300 in a few days). As they do not log much, I can just provide you those outputs. Any ideas how to debug this further? In the logs we see:
Raw
1201 08:42:28.395691 1 ratelimiter.go:50] error reloading router: wait: no child processes

Occasionally we report pods which are failing to connect or resolve a service name within the OpenShift cluster. After a few seconds it starts to work again and we have no idea where this is coming from.
I'm having weird networking problems in my OpenShift cluster and I see the following messages reported on my nodes
Raw
Dec 06 09:39:11 ose3-node1 kernel: net_ratelimit: 119 callbacks suppressed
Dec 06 09:39:17 ose3-node1 kernel: net_ratelimit: 154 callbacks suppressed

Resolution
In Red Hat OpenShift Container Platform clusters with large numbers of routes (greater than the value of net.ipv4.neigh.default.gc_thresh3, which by default is 1024), it's necessary to increase the default values of sysctl variables to allow more entries in the ARP cache.
The following sysctl values are suggested for clusters with large numbers of routes:
Raw
net.ipv4.neigh.default.gc_thresh1 = 8192
net.ipv4.neigh.default.gc_thresh2 = 32768
net.ipv4.neigh.default.gc_thresh3 = 65536

To make these settings permanent across reboots, create a custom tuned profile.
Root Cause
Red Hat OpenShift Container Platform was overflowing the neighbor cache because the router wanted to talk to so many IP addresses on the local SDN network. The default neighbor (ARP) cache is 1024. So when it overflows and removes entries, it could remove an entry for a packet that was in the queue and the packet would get dropped.
Diagnostic Steps
Compare ARP table against kernel settings:
Raw
# ip neigh show | wc -l
1005
# sysctl -a | grep net.ipv4.neigh.default.gc_thresh
net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024

In the example above, ARP neighbors is close to threshold setting. Threshold setting would need to be raised in this case.
Product(s) Red Hat OpenShift Container Platform
Component haproxy kubernetes
Tags haproxy kernel kubernetes openshift
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

33号球手

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Networking and Router Problems in Red Hat Open Shift Container Platform

项目场景：一台服务器作为服务器端，与后端的2000台服务器agent进行通信。类似于oracle 的oms服务器端，与其他被监控的服务器的agent进行通信。问题描述：服务器的网络经常性出现异常，网络异常时，无法通过ssh进行登录。但是过一段时间又自动恢复。且在一台服务器没有ssh到服务器端时，另外一台服务器可能正常ssh到服务器端。无法通过登录，登录报 connection error to xx.xx.xx.xx:22，过一段时间后恢复正常。服务器的/var/...
复制链接

扫一扫