问题现象:
发现k8s-node1节点的coredns出现0/1 Running状态;
查看详细信息:kubectl describe pod coredns-57d4cbf879-xgk2f -n kube-system
[root@k8s-master kubernetes]# kubectl describe pod coredns-57d4cbf879-xgk2f -n kube-system
Name: coredns-57d4cbf879-xgk2f
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: k8s-node1.hikvision.com/192.168.1.172
Start Time: Thu, 18 Nov 2021 15:04:33 +0800
Labels: k8s-app=kube-dns
pod-template-hash=57d4cbf879
Annotations: cni.projectcalico.org/podIP: 192.168.62.1/32
Status: Running
IP: 192.168.62.1
IPs:
IP: 192.168.62.1
Controlled By: ReplicaSet/coredns-57d4cbf879
Containers:
coredns:
Container ID: docker://ed8fec719b0811be6b761e0acdc15ae9492f4366f7947ca81502e959e0e51efc
Image: registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
Image ID: docker://sha256:296a6d5035e2d6919249e02709a488d680ddca91357602bd65e605eac967b899
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Thu, 18 Nov 2021 15:12:38 +0800
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 18 Nov 2021 15:04:38 +0800
Finished: Thu, 18 Nov 2021 15:12:20 +0800
Ready: False
Restart Count: 1
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7p8r6 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
kube-api-access-7p8r6:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 45m default-scheduler Successfully assigned kube-system/coredns-57d4cbf879-xgk2f to k8s-node1.hikvision.com
Normal Pulled 45m kubelet Container image "registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0" already present on machine
Normal Created 45m kubelet Created container coredns
Normal Started 45m kubelet Started container coredns
Warning Unhealthy 3s (x271 over 45m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
[root@k8s-master kubernetes]#
主要报错信息为:活动探测失败:HTTP探测失败状态码为:503 这个信息对我来说完全没用,目前只知道是node从机kubelet上部署的coreDNS组件工作异常,仅此而已。
查看日志详细信息:kubectl logs -f coredns-57d4cbf879-xgk2f -n kube-system
[root@k8s-master kubernetes]# kubectl logs -f coredns-57d4cbf879-xgk2f -n kube-system
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.8.0
linux/amd64, go1.15.3, 054c9ae
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
I1118 07:13:08.322248 1 trace.go:205] Trace[939984059]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (18-Nov-2021 07:12:38.320) (total time: 30001ms):
Trace[939984059]: [30.001197447s] [30.001197447s] END
E1118 07:13:08.322337 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I1118 07:13:08.322322 1 trace.go:205] Trace[1427131847]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (18-Nov-2021 07:12:38.320) (total time: 30001ms):
Trace[1427131847]: [30.001190253s] [30.001190253s] END
I1118 07:13:08.322380 1 trace.go:205] Trace[911902081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (18-Nov-2021 07:12:38.320) (total time: 30001ms):
Trace[911902081]: [30.001362202s] [30.001362202s] END
E1118 07:13:08.322407 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
E1118 07:13:08.322454 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
从上面的信息大概知道,是由于从机无法通过coreDNS组件访问10.96.0.1:443 该地址造成的,dial tcp 10.96.0.1:443: i/o timeout
之后继续谷歌,之后在一篇求助贴中看到有人遇到跟我一样的问题并且已经顺利解决!!那个激动啊!!
帖子原地址:https://github.com/coredns/coredns/issues/2325
该作者将解决问题的思路又写为博客并发布了出来:https://medium.com/@cminion/quicknote-kubernetes-networking-issues-78f1e0d06e12
大致描述如下:
症状
工作节点(从机)上的Pod无法连接到API服务器
超时连接到10.96.0.1
但master上的pod(可能没有污染)工作正常。
解
当您使用kubeadm init时,请指定pod-network-cidr。确保主机/主网络的IP不在您引用的子网中。
即如果您的网络运行在192.168.*.*使用10.0.0.0/16
如果您的网络是10.0.*.*使用192.168.0.0/16
我忘了子网是如何工作的,所以我没有意识到192.168.0.1与192.168.100.0/16在同一子网中。简而言之,当您使用16子网标记时,它意味着使用192.168.*.*中的任何内容。
由于我的网络运行在192.168.1。*主机在192.168.0上运行正常。*但我的工作人员无法通信,因为它试图在192.168.1上运行。*因此很好地导致我的盒子上的路由问题。
也就是说,在使用”kubeadm init“命令初始化master节点时,在给Calico网络插件分配CIDR网段时,自己环境中(master节点和工作节点)的ip地址不能够跟Calico网络插件的网段重合!!不得不说这个问题非常隐蔽啊!!新手非常容易忽略。
解决方法:
这里的注意就是这个意思。
接下来只能重新配置集群环境了,分别在master节点和node工作节点上执行”kubeadm reset“命令,然后首先在master节点运行新的”kubeadm init“命令:”sudo kubeadm init --image-repository registry.aliyuncs.com/google_containers --kubernetes-version v1.13.1 --pod-network-cidr=10.0.0.0/16“,之后的步骤就跟博客中附上的第一篇教程贴的步骤一样了!
可以在kubeadm.yaml中配置pod子网:
vim /usr/local/docker/kubernetes/kubeadm.yaml
然后执行kubeadm reset重置;
之后在按照centos7.6安装Kubernetes-V1.21_mayi_xiaochaun的博客-CSDN博客
从4.3到5.1步骤执行一遍重新初始化kubeadm
添加node之后,master查看状态正常: