node节点NotReady

问题场景:

1、想访问业务容器http://dashboard.od.com,访问不了,或者访问刷新一下显示,刷新一下显示bad gateway

2、查看dashboard.od.com的容器pod无问题,RUNNING

3、执行 dig -t A dashboard.od.com @192.168.0.2 +short 出问题了,不通了

dashboard.od.com 为访问的业务域,192.168.0.2为集群IP)

[root@hdss7-22 ~]# dig -t A dashboard.od.com @192.168.0.2 +short
 
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.9 <<>> -t A dashboard.od.com @192.168.0.2 +short
;; global options: +cmd
;; connection timed out; no servers could be reached
[root@hdss7-22 ~]# dig -t A dashboard.od.com @192.168.0.2 +short
 
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.9 <<>> -t A dashboard.od.com @192.168.0.2 +short
;; global options: +cmd
;; connection timed out; no servers could be reached

4、按照写过的文档,先排除是默认的FORWARD规则导致

1、可能是由于默认的iptables的reject链路的FORWARD规则导致
[root@hdss7-22 ~]# iptables-save |grep reject
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
 
[root@hdss7-22 ~]# iptables -t filter -D FORWARD -j REJECT --reject-with icmp-host-prohibited
[root@hdss7-22 ~]# dig -t A dashboard.od.com @192.168.0.2 +short
10.4.7.10
 
如果还是不行,可以尝试把另一个链路删除看看
[root@hdss7-22 ~]# iptables-save |grep reject
-A INPUT -j REJECT --reject-with icmp-host-prohibited
 
[root@hdss7-22 ~]# iptables -t filter -D INPUT -j REJECT --reject-with icmp-host-prohibited

5、如果 dig -t A dashboard.od.com @192.168.0.2 +short还是有问题

2、可能是coredns 宕掉导致,重启coredns的pod
 
3、如果重启coredns pod会报错,检查节点是否能通过段域名相互相问(ping hdss7-200),如果段域名不
通,长域名通,就是/etc/resolv.conf导致,如果长域名(ping hdss7-200.host.com)也不通,就是
named服务10.4.7.11的问题
 
4、如果是named 10.4.7.11的问题,重启named不能解决,通过长域名(ping hdss7-200.host.com)ping
通,但是自身的10.4.7.11却能够ping通所有节点,就是10.4.7.11的iptables问题,查看是不是安装了
iptables-server,,就是查看是不是启动的iptables,systemctl status iptables,如果开了, 
systemctl stop iptables;systemctl disable iptables
 
5、dig -t A dashboard.od.com @192.168.0.2 +short可以了,但是访问dashboard.od.com 访问不了,重启coredns pod 
 
6、还是不行,重启10.4.7.11,重新加入iptables,重启node节点
[root@hdss7-21 ~]# iptables -t nat -D POSTROUTING -s 172.7.21.0/24 ! -o docker0 -j MASQUERADE
[root@hdss7-21 ~]# iptables -t nat -I POSTROUTING -s 172.7.21.0/24 ! -d 172.7.0.0/16 ! -o docker0 -j MASQUERADE
 
[root@hdss7-21 ~]# iptables -t filter -D INPUT -j REJECT --reject-with icmp-host-prohibited
[root@hdss7-21 ~]# iptables -t filter -D FORWARD -j REJECT --reject-with icmp-host-prohibited
[root@hdss7-21 ~]# iptables-save >/etc/sysconfig/iptables

6、如果还不行,在重启coredns的pod,出问题了,core容器起不来

[root@hdss7-21 ~]# kubectl get pod -n kube-system
NAME                                    READY   STATUS        RESTARTS   AGE
coredns-6d976bcb65-dzlct                1/1     Terminating   0          20m
coredns-6d976bcb65-mz9lp                0/1     Pending       0          63s
kubernetes-dashboard-7977cc79db-6p4g2   1/1     Running       3          2d15h
traefik-ingress-qp8k7                   1/1     Running       4          2d23h
traefik-ingress-tqtsr                   1/1     Running       4          2d23h

7、强制删除coredns的pod,还是Pending,再次重启pod还是

[root@hdss7-21 ~]# kubectl delete pod coredns-6d976bcb65-dzlct -n kube-system --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "coredns-6d976bcb65-dzlct" force deleted
 
[root@hdss7-21 ~]# kubectl get pod -n kube-system
NAME                                    READY   STATUS    RESTARTS   AGE
coredns-6d976bcb65-mz9lp                0/1     Pending   0          106s
kubernetes-dashboard-7977cc79db-6p4g2   1/1     Running   3          2d15h
traefik-ingress-qp8k7                   1/1     Running   4          2d23h
traefik-ingress-tqtsr                   1/1     Running   4          2d23h

8、查看coredns的pod的报错日志,根本看不懂说啥

[root@hdss7-21 ~]# kubectl logs coredns-6d976bcb65-dzlct -n kube-system
.:53
2022-03-23T00:50:00.370Z [INFO] plugin/reload: Running configuration MD5 = a6f3121e89bcc0078758b78198d1ddde
2022-03-23T00:50:00.370Z [INFO] CoreDNS-1.6.1
2022-03-23T00:50:00.370Z [INFO] linux/amd64, go1.12.7, 1fda570
CoreDNS-1.6.1
linux/amd64, go1.12.7, 1fda570
2022-03-23T00:50:01.490Z [INFO] 127.0.0.1:53888 - 9565 "HINFO IN 3797648408549217670.517492447913026483. udp 56 false 512" NXDOMAIN qr,rd,ra 131 0.119263588s

9、查看coredns的pod的详细信息,说是有污点

[root@hdss7-21 ~]#kubectl describe pod coredns-6d976bcb65-dzlct -n kube-system
  Warning  FailedScheduling  57s   default-scheduler  0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate

10、查看还真有污点

[root@hdss7-21 ~]# kubectl describe node hdss7-21.host.com |grep -i taints
Taints:             node-role.kubernetes.io/master=master:NoSchedule

11、那就删除污点,无法找到

[root@hdss7-21 ~]# kubectl taint node hdss7-21.host.com node-role.kubernetes.io/master-
error: taint "node-role.kubernetes.io/master:" not found

12、那看看节点状态把,什么情况,还变成NotReady 。这就说的通了,为什么说是有污点,因为没法调度,节点根本没准备好

root@hdss7-21 ~]# kubectl get nodes -o wide
NAME                STATUS     ROLES         AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION           CONTAINER-RUNTIME
hdss7-21.host.com   NotReady   master,node   9d    v1.15.2   10.4.7.21     <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.13
hdss7-22.host.com   NotReady   master,node   9d    v1.15.2   10.4.7.22     <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.13

13、到这步,就得看看kubectl、kube-proxy组件是不是正常runing

[root@hdss7-21 ~]# supervisorctl status
etcd-server-7-21                 RUNNING   pid 42597, uptime 0:32:19
flanneld-7-21                    RUNNING   pid 43151, uptime 0:30:58
kube-apiserver-7-21              RUNNING   pid 42595, uptime 0:32:19
kube-controller-manager-7-21     RUNNING   pid 42596, uptime 0:32:19
kube-kubelet-7-21                RUNNING   pid 44067, uptime 0:28:26
kube-proxy-7-21                  RUNNING   pid 44567, uptime 0:27:36
kube-scheduler-7-21              RUNNING   pid 42594, uptime 0:32:19
[root@hdss7-21 ~]# 

14、查看集群状态,etcd没问题呀

[root@hdss7-21 ~]# kubectl get cs
NAME                 STATUS    MESSAGE              ERROR
controller-manager   Healthy   ok                   
scheduler            Healthy   ok                   
etcd-1               Healthy   {"health": "true"}   
etcd-2               Healthy   {"health": "true"}   
etcd-0               Healthy   {"health": "true"}   
[root@hdss7-21 ~]# 

15、重启kubectl、kube-proxy组件一样可以正常runing,没报错

16、重启docker服务

17、那就查看kubectl的日志把,node节点是跟kubectl息息相关的,只有物理机器部署kubectl,kubectl才能在一台物理机器,部署想要的pod,kuebctl的一些状态、怎么部署,需要通过跟k8s的apiserver通信,而etcd作为中间件,apiserver部署资源给etcd,etcd去拿。现在etcd没问题,那就看kubectl了。

[root@hdss7-21 ~]# tail -100f /data/logs/kubernetes/kube-kubelet/kubelet.stdout.log 

E0323 10:21:30.007404   44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.107861   44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.209759   44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.310686   44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.411907   44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.512347   44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.537973   44072 kubelet_node_status.go:94] Unable to register node "hdss7-21.host.com" with API server: Post https://10.4.7.10:7443/api/v1/nodes: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.538122   44072 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https://10.4.7.10:7443/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.538239   44072 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://10.4.7.10:7443/api/v1/nodes?fieldSelector=metadata.name%3Dhdss7-21.host.com&limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.538420   44072 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.4.7.10:7443/api/v1/pods?fieldSelector=spec.nodeName%3Dhdss7-21.host.com&limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.538564   44072 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://10.4.7.10:7443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.538657   44072 controller.go:115] failed to ensure node lease exists, will retry in 7s, error: Get https://10.4.7.10:7443/apis/coordination.k8s.io/v1beta1/namespaces/kube-node-lease/leases/hdss7-21.host.com?timeout=10s: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.539465   44072 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://10.4.7.10:7443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0323 10:21:30.631109   44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.731590   44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.832935   44072 kubelet.go:2248] node "hdss7-21.host.com" not found
E0323 10:21:30.934292   44072 kubelet.go:2248] node "hdss7-21.host.com" not found

提示node "hdss7-21.host.com" 找不到,ping hdss7-21.host.com 无问题

提示 https://10.4.7.10:7443/api/v1/nodes: dial tcp 10.4.7.10:7443: connect: no route to host,ping 10.4.7.10 不通了。问题找到了,那就重启keepalived,解决

解决方案:

node节点NotReady,kubectl、kube-proxy、etcd组件是不是正常runing,重启kubectl、kube-proxy组件,如果还不行,重启docker服务,如果还不行,查看etcd集群的健康状态,如果正常,就查看kubectl的日志

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值