问题场景:
1、想访问业务容器http://dashboard.od.com,访问不了,或者访问刷新一下显示,刷新一下显示bad gateway
2、查看dashboard.od.com的容器pod无问题,RUNNING
3、执行 dig -t A dashboard.od.com @192.168.0.2 +short 出问题了,不通了
(dashboard.od.com 为访问的业务域,192.168.0.2为集群IP)
[root@hdss7-22 ~]# dig -t A dashboard.od.com @192.168.0.2 +short
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.9 <<>> -t A dashboard.od.com @192.168.0.2 +short
;; global options: +cmd
;; connection timed out; no servers could be reached
[root@hdss7-22 ~]# dig -t A dashboard.od.com @192.168.0.2 +short
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.9 <<>> -t A dashboard.od.com @192.168.0.2 +short
;; global options: +cmd
;; connection timed out; no servers could be reached
4、按照写过的文档,先排除是默认的FORWARD规则导致
1、可能是由于默认的iptables的reject链路的FORWARD规则导致
[root@hdss7-22 ~]# iptables-save |grep reject
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
[root@hdss7-22 ~]# iptables -t filter -D FORWARD -j REJECT --reject-with icmp-host-prohibited
[root@hdss7-22 ~]# dig -t A dashboard.od.com @192.168.0.2 +short
10.4.7.10
如果还是不行,可以尝试把另一个链路删除看看
[root@hdss7-22 ~]# iptables-save |grep reject
-A INPUT -j REJECT --reject-with icmp-host-prohibited
[root@hdss7-22 ~]# iptables -t filter -D INPUT -j REJECT --reject-with icmp-host-prohibited
5、如果 dig -t A dashboard.od.com @192.168.0.2 +short还是有问题
2、可能是coredns 宕掉导致,重启coredns的pod
3、如果重启coredns pod会报错,检查节点是否能通过段域名相互相问(ping hdss7-200),如果段域名不
通,长域名通,就是/etc/resolv.conf导致,如果长域名(ping hdss7-200.host.com)也不通,就是
named服务10.4.7.11的问题
4、如果是named 10.4.7.11的问题,重启named不能解决,通过长域名(ping hdss7-200.host.com)ping
通,但是自身的10.4.7.11却能够ping通所有节点,就是10.4.7.11的iptables问题,查看是不是安装了
iptables-server,,就是查看是不是启动的iptables,systemctl status iptables,如果开了,
systemctl stop iptables;systemctl disable iptables
5、dig -t A dashboard.od.com @192.168.0.2 +short可以了,但是访问dashboard.od.com 访问不了,重启coredns pod
6、还是不行,重启10.4.7.11,重新加入iptables,重启node节点
[root@hdss7-21 ~]# iptables -t nat -D POSTROUTING -s 172.7.21.0/24 ! -o docker0 -j MASQUERADE
[root@hdss7-21 ~]# iptables -t nat -I POSTROUTING -s 172.7.21.0/24 ! -d 172.7.0.0/16 ! -o docker0 -j MASQUERADE
[root@hdss7-21 ~]# iptables -t filter -D INPUT -j REJECT --reject-with icmp-host-prohibited
[root@hdss7-21 ~]# iptables -t filter -D FORWARD -j REJECT --reject-with icmp-host-prohibited
[root@hdss7-21 ~]# iptables-save >/etc/sysconfig/iptables
6、如果还不行,在重启coredns的pod,出问题了,core容器起不来
[root@hdss7-21 ~]# kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6d976bcb65-dzlct 1/1 Terminating 0 20m
coredns-6d976bcb65-mz9lp 0/1 Pending 0 63s
kubernetes-dashboard-7977cc79db-6p4g2 1/1 Running 3 2d15h
traefik-ingress-qp8k7 1/1 Running 4 2d23h
traefik-ingress-tqtsr 1/1 Running 4 2d23h
7、强制删除coredns的pod,还是Pending,再次重启pod还是
[root@hdss7-21 ~]# kubectl delete pod coredns-6d976bcb65-dzlct -n kube-system --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "coredns-6d976bcb65-dzlct" force deleted
[root@hdss7-21 ~]# kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6d976bcb65-mz9lp 0/1 Pending 0 106s
kubernetes-dashboard-7977cc79db-6p4g2 1/1 Running 3 2d15h
traefik-ingress-qp8k7 1/1 Running 4 2d23h
traefik-ingress-tqtsr 1/1 Running 4 2d23h
8、查看coredns的pod的报错日志,根本看不懂说啥
[root@hdss7-21 ~]# kubectl logs coredns-6d976bcb65-dzlct -n kube-system
.:53
2022-03-23T00:50:00.370Z [INFO] plugin/reload: Running configuration MD5 = a6f3121e89bcc0078758b78198d1ddde
2022-03-23T00:50:00.370Z [INFO] CoreDNS-1.6.1
2022-03-23T00:50:00.370Z [INFO] linux/amd64, go1.12.7, 1fda570
CoreDNS-1.6.1
linux/amd64, go1.12.7, 1fda570
2022-03-23T00:50:01.490Z [INFO] 127.0.0.1:53888 - 9565 "HINFO IN 3797648408549217670.517492447913026483. udp 56 false 512" NXDOMAIN qr,rd,ra 131 0.119263588s
9、查看coredns的pod的详细信息,说是有污点
[root@hdss7-21 ~]#kubectl describe pod coredns-6d976bcb65-dzlct -n kube-system
Warning FailedScheduling 57s default-scheduler 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate
10、查看还真有污点
[root@hdss7-21 ~]# kubectl describe node hdss7-21.host.com |grep -i taints
Taints: node-role.kubernetes.io/master=master:NoSchedule
11、那就删除污点,无法找到
[root@hdss7-21 ~]# kubectl taint node hdss7-21.host.com node-role.kubernetes.io/master-
error: taint "node-role.kubernetes.io/master:" not found
12、那看看节点状态把,什么情况,还变成NotReady 。这就说的通了,为什么说是有污点,因为没法调度,节点根本没准备好
root@hdss7-21 ~]# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
hdss7-21.host.com NotReady master,node 9d v1.15.2 10.4.7.21 <none> CentOS Linux 7 (Core) 3.10.0-1160.el7.x86_64 docker://20.10.13
hdss7-22.host.com NotReady master,node 9d v1.15.2 10.4.7.22 <none> CentOS Linux 7 (Core) 3.10.0-1160.el7.x86_64 docker://20.10.13
13、到这步,就得看看kubectl、kube-proxy组件是不是正常runing
[root@hdss7-21 ~]# supervisorctl status
etcd-server-7-21 RUNNING pid 42597, uptime 0:32:19
flanneld-7-21 RUNNING pid 43151, uptime 0:30:58
kube-apiserver-7-21 RUNNING pid 42595, uptime 0:32:19
kube-controller-manager-7-21 RUNNING pid 42596, uptime 0:32:19
kube-kubelet-7-21 RUNNING pid 44067, uptime 0:28:26
kube-proxy-7-21 RUNNING pid 44567, uptime 0:27:36
kube-scheduler-7-21 RUNNING pid 42594, uptime 0:32:19
[root@hdss7-21 ~]#
14、查看集群状态,etcd没问题呀
[root@hdss7-21 ~]# kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-1 Healthy {"health": "true"}
etcd-2 Healthy {"health": "true"}
etcd-0 Healthy {"health": "true"}
[root@hdss7-21 ~]#
15、重启kubectl、kube-proxy组件一样可以正常runing,没报错
16、重启docker服务
17、那就查看kubectl的日志把,node节点是跟kubectl息息相关的,只有物理机器部署kubectl,kubectl才能在一台物理机器,部署想要的pod,kuebctl的一些状态、怎么部署,需要通过跟k8s的apiserver通信,而etcd作为中间件,apiserver部署资源给etcd,etcd去拿。现在etcd没问题,那就看kubectl了。
[root@hdss7-21 ~]# tail -100f /data/logs/kubernetes/kube-kubelet/kubelet.stdout.log
E0323 10:21:30.007404 44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.107861 44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.209759 44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.310686 44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.411907 44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.512347 44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.537973 44072 kubelet_node_status.go:94] Unable to register node "hdss7-21.host.com" with API server: Post https://10.4.7.10:7443/api/v1/nodes: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.538122 44072 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https://10.4.7.10:7443/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.538239 44072 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://10.4.7.10:7443/api/v1/nodes?fieldSelector=metadata.name%3Dhdss7-21.host.com&limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.538420 44072 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.4.7.10:7443/api/v1/pods?fieldSelector=spec.nodeName%3Dhdss7-21.host.com&limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.538564 44072 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://10.4.7.10:7443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.538657 44072 controller.go:115] failed to ensure node lease exists, will retry in 7s, error: Get https://10.4.7.10:7443/apis/coordination.k8s.io/v1beta1/namespaces/kube-node-lease/leases/hdss7-21.host.com?timeout=10s: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.539465 44072 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://10.4.7.10:7443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.631109 44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.731590 44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.832935 44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.934292 44072 kubelet.go:2248] node "hdss7-21.host.com" not found
提示node "hdss7-21.host.com" 找不到,ping hdss7-21.host.com 无问题
提示 https://10.4.7.10:7443/api/v1/nodes: dial tcp 10.4.7.10:7443: connect: no route to host,ping 10.4.7.10 不通了。问题找到了,那就重启keepalived,解决
解决方案:
node节点NotReady,kubectl、kube-proxy、etcd组件是不是正常runing,重启kubectl、kube-proxy组件,如果还不行,重启docker服务,如果还不行,查看etcd集群的健康状态,如果正常,就查看kubectl的日志