排查方法:
1、检查cilium状态
$ kubectl get pod -A | grep cilium
kube-system cilium-7x9pg 1/1 Running 0 5h53m
kube-system cilium-operator-6cbdf6b84-2mphp 1/1 Running 0 5h53m
kube-system cilium-operator-6cbdf6b84-lmllq 1/1 Running 0 5h53m
kube-system cilium-rtm8f 1/1 Running 0 5h53m
2、进入cilium pod中检查各节点运行状态,正常情况如下所示,如果出现异常为unreachable
$ kubectl exec -it cilium-7x9pg -n kube-system -- cilium status
KVStore: Ok Disabled
Kubernetes: Ok 1.20 (v1.20.6) [linux/amd64]
Kubernetes APIs: ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1beta1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Strict [eth0 10.0.2.161 (Direct Routing)]
Cilium: Ok 1.10.5 (v1.10.5-b0836e8)
NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 4/254 allocated from 241.255.0.0/24,
BandwidthManager: Disabled
Host Routing: Legacy
Masquerading: BPF [eth0] 241.255.0.0/24 [IPv4: Enabled, IPv6: Disabled]
Controller Status: 27/27 healthy
Proxy Status: OK, ip 241.255.0.76, 0 redirects active on ports 10000-20000
Hubble: Ok Current/Max Flows: 4095/4095 (100.00%), Flows/s: 4.95 Metrics: Disabled
Encryption: Disabled
Cluster health: 2/2 reachable (2022-09-22T14:57:12Z)
3、排查pod跨主机是否通信
busybox.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: busybox
spec:
selector:
matchLabels:
app: busybox
template:
metadata:
labels:
app: busybox
spec:
containers:
- name: busybox
image: busybox #内置的linux大多数命令,多用于测试
args:
- /bin/sh
- -c
- sleep 10; touch /tmp/healthy; sleep 30000
readinessProbe: #就绪探针
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 10 #10s之后开始第一次探测
$ kubectl apply -f busybox.yaml
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox-579rz 1/1 Running 0 106s 241.255.1.236 node-1 <none> <none>
busybox-5jpr5 1/1 Running 0 106s 241.255.0.81 master <none> <none>
#从一个pod ping另外pod ip,正常情况下能够ping通,如果无法ping通,说明网络存在问题
$ kubectl exec -it busybox-579rz -- ping 241.255.0.81
PING 241.255.0.81 (241.255.0.81): 56 data bytes
64 bytes from 241.255.0.81: seq=0 ttl=63 time=1.156 ms
64 bytes from 241.255.0.81: seq=1 ttl=63 time=0.914 ms
64 bytes from 241.255.0.81: seq=2 ttl=63 time=0.949 ms
64 bytes from 241.255.0.81: seq=3 ttl=63 time=0.854 ms
64 bytes from 241.255.0.81: seq=4 ttl=63 time=0.988 ms
4、卸载集群,使用Calico网络插件安装后,按照上述方法再重新测试
$ sealos clean --all
#不指定--network cilium 和--without-cni 默认安装Calico网络插件
$ sealos init --passwd '123456' --master 10.21.2.161:22 --user root --node 10.21.2.162:22 --vip 10.21.2.165 --podcidr 241.255.0.0/16 --svccidr 241.254.0.0/16 --cert-sans sudytech.com --kubeadm-config ./kubeadm-config.yaml.tmpl --pkg-url /root/kube1.20.6.tar.gz --version v1.20.6
排查方向:
1、服务器是否是云服务器,如果为云服务器,需要关闭防护策略
2、服务器之间的协议是否有拦截,例如:tcp、udp的协议,都要放行.Cilium使用的就是UDP 8472 端口作为 vtep 端点的服务
3、服务器之间是否存在防护策略(例如:waf、网络等),如果有,需要都关掉。
4、校方是否存在vxlan网络,如果有,需要将podSubnet: "241.255.0.0/24"配置到VPC专有网络的路由表.
参考链接:https://blog.csdn.net/qq_14962891/article/details/117223573
或者让虚拟机提供商那边vxlan端口的8472修改为4789