问题
Kubernetes集群节点NotReady
先排查swap是否关闭
free -h
关闭swap
swapoff -a
sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
从master节点describe故障节点状态
[root@wl-master /home/ubuntu]# kubectl describe node worker02-wl-2
Name: worker02-wl-2
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=worker02-wl-2
kubernetes.io/os=linux
Annotations: flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"c6:4d:65:5c:87:3a"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.10.10.209
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 25 Oct 2022 06:22:18 +0000
Taints: node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: worker02-wl-2
AcquireTime: <unset>
RenewTime: Thu, 03 Nov 2022 02:38:51 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 25 Oct 2022 06:24:16 +0000 Tue, 25 Oct 2022 06:24:16 +0000 FlannelIsUp Flannel is running on this node
MemoryPressure Unknown Thu, 03 Nov 2022 02:38:36 +0000 Thu, 03 Nov 2022 02:39:32 +0000 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Thu, 03 Nov 2022 02:38:36 +0000 Thu, 03 Nov 2022 02:39:32 +0000 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Thu, 03 Nov 2022 02:38:36 +0000 Thu, 03 Nov 2022 02:39:32 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Thu, 03 Nov 2022 02:38:36 +0000 Thu, 03 Nov 2022 02:39:32 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 10.10.10.209
Hostname: worker02-wl-2
Capacity:
cpu: 4
ephemeral-storage: 40458684Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 4038644Ki
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 37286723113
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3936244Ki
pods: 110
System Info:
Machine ID: cd1b7061a9d545dd8219916c9737143b
System UUID: CD1B7061-A9D5-45DD-8219-916C9737143B
Boot ID: 114cf509-6163-490d-9240-3e7246d0d8b7
Kernel Version: 4.15.0-194-generic
OS Image: Ubuntu 18.04.6 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.1
Kubelet Version: v1.19.2
Kube-Proxy Version: v1.19.2
PodCIDR: 10.244.3.0/24
PodCIDRs: 10.244.3.0/24
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
istio-system prometheus-69f7f4d689-cllns 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3h45m
kube-flannel kube-flannel-ds-lwztz 100m (2%) 100m (2%) 50Mi (1%) 50Mi (1%) 9d
kube-system kube-flannel-ds-p2w5q 100m (2%) 100m (2%) 50Mi (1%) 50Mi (1%) 9d
kube-system kube-proxy-49mvl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 200m (5%) 200m (5%)
memory 100Mi (2%) 100Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
可以看到condition部分显示的为Unknown
:Kubelet stopped posting node status.
大致的意思是 Kubelet 停止发送 node 状态了。正常情况下显示如下:
[root@wl-master /home/ubuntu]# kubectl describe node worker02-wl-1
Name: worker02-wl-1
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=worker02-wl-1
kubernetes.io/os=linux
Annotations: flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"c6:1c:70:48:b0:cc"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.10.10.229
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 19 Oct 2022 14:09:42 +0000
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: worker02-wl-1
AcquireTime: <unset>
RenewTime: Thu, 03 Nov 2022 06:25:15 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Mon, 24 Oct 2022 06:17:58 +0000 Mon, 24 Oct 2022 06:17:58 +0000 FlannelIsUp Flannel is running on this node
MemoryPressure False Thu, 03 Nov 2022 06:21:06 +0000 Wed, 19 Oct 2022 14:09:42 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 03 Nov 2022 06:21:06 +0000 Wed, 19 Oct 2022 14:09:42 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 03 Nov 2022 06:21:06 +0000 Wed, 19 Oct 2022 14:09:42 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 03 Nov 2022 06:21:06 +0000 Wed, 19 Oct 2022 14:09:43 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.10.10.229
Hostname: worker02-wl-1
Capacity:
cpu: 4
ephemeral-storage: 40470732Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 4038632Ki
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 37297826550
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3936232Ki
pods: 110
System Info:
Machine ID: 0d5a0ccf23de42b899ac201e08ceb571
System UUID: 0D5A0CCF-23DE-42B8-99AC-201E08CEB571
Boot ID: fb332d05-2f0d-4668-967c-27ae026644fe
Kernel Version: 4.15.0-169-generic
OS Image: Ubuntu 18.04.6 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.1
Kubelet Version: v1.19.2
Kube-Proxy Version: v1.19.2
PodCIDR: 10.244.2.0/24
PodCIDRs: 10.244.2.0/24
Non-terminated Pods: (14 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default details-v1-79f774bdb9-m498h 10m (0%) 2 (50%) 40Mi (1%) 1Gi (26%) 9d
default productpage-v1-6b746f74dc-q5p5q 10m (0%) 2 (50%) 40Mi (1%) 1Gi (26%) 9d
default ratings-v1-b6994bb9-zcqlt 10m (0%) 2 (50%) 40Mi (1%) 1Gi (26%) 9d
default reviews-v1-545db77b95-mvncw 10m (0%) 2 (50%) 40Mi (1%) 1Gi (26%) 9d
default reviews-v2-7bf8c9648f-g69rm 10m (0%) 2 (50%) 40Mi (1%) 1Gi (26%) 9d
default reviews-v3-84779c7bbc-8fq88 10m (0%) 2 (50%) 40Mi (1%) 1Gi (26%) 9d
istio-operator istio-operator-99f9c574d-cjt2m 50m (1%) 200m (5%) 128Mi (3%) 256Mi (6%) 9d
istio-system istio-egressgateway-757584858-kgfzq 10m (0%) 2 (50%) 40Mi (1%) 1Gi (26%) 10d
istio-system istio-ingressgateway-cf4f68b6d-h4p2q 10m (0%) 2 (50%) 40Mi (1%) 1Gi (26%) 9d
istio-system istiod-6c5f5698d-4pn4f 10m (0%) 0 (0%) 100Mi (2%) 0 (0%) 10d
kube-flannel kube-flannel-ds-6bkj5 100m (2%) 100m (2%) 50Mi (1%) 50Mi (1%) 14d
kube-system coredns-6c76c8bb89-8hmxk 100m (2%) 0 (0%) 70Mi (1%) 170Mi (4%) 9d
kube-system kube-flannel-ds-lwmn4 100m (2%) 100m (2%) 50Mi (1%) 50Mi (1%) 10d
kube-system kube-proxy-r46s6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 440m (11%) 16400m (409%)
memory 718Mi (18%) 8718Mi (226%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
查看下 Kubelet 是否在正常运行,在子节点使用命令:systemctl status kubelet,如果状态为 Failed,那么是需要重启:需要重启docker:
sudo systemctl restart docker
;需要重启kubelet:sudo systemctl restart kubelet
。但如果是正常运行,请继续向下看。
查看NotReady节点上的pod状态
kubectl get pod --all-namespaces -owide |grep worker02-wl-2
可以看到网络flannel有问题。
查看错误pod情况
[root@wl-master /home/ubuntu]# kubectl describe pod kube-flannel-ds-lwztz -n kube-flannel
Name: kube-flannel-ds-lwztz
Namespace: kube-flannel
Priority: 2000001000
Priority Class Name: system-node-critical
Node: worker02-wl-2/10.10.10.209
Start Time: Tue, 25 Oct 2022 06:22:19 +0000
Labels: app=flannel
controller-revision-hash=745f596757
pod-template-generation=1
tier=node
Annotations: <none>
Status: Running
IP: 10.10.10.209
IPs:
IP: 10.10.10.209
Controlled By: DaemonSet/kube-flannel-ds
Init Containers:
install-cni-plugin:
Container ID: docker://060253344ea958620357439a1c1fa1c021f27f47ef832b9c68d298be93320e42
Image: docker.io/rancher/mirrored-flannelcni-flannel-cni-plugin:v1.1.0
Image ID: docker-pullable://rancher/mirrored-flannelcni-flannel-cni-plugin@sha256:28d3a6be9f450282bf42e4dad143d41da23e3d91f66f19c01ee7fd21fd17cb2b
Port: <none>
Host Port: <none>
Command:
cp
Args:
-f
/flannel
/opt/cni/bin/flannel
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 25 Oct 2022 06:24:35 +0000
Finished: Tue, 25 Oct 2022 06:24:35 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/opt/cni/bin from cni-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from flannel-token-2bnmk (ro)
install-cni:
Container ID: docker://d23820d6f05f19411581a837a74a0b2634923c200704e74541fdd7cf7ca04586
Image: docker.io/rancher/mirrored-flannelcni-flannel:v0.20.0
Image ID: docker-pullable://rancher/mirrored-flannelcni-flannel@sha256:24e693e10c53c9d5dd78196f77cd5328d6b9d90aff203a37de07d3b040dc938d
Port: <none>
Host Port: <none>
Command:
cp
Args:
-f
/etc/kube-flannel/cni-conf.json
/etc/cni/net.d/10-flannel.conflist
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 25 Oct 2022 06:25:13 +0000
Finished: Tue, 25 Oct 2022 06:25:13 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/etc/cni/net.d from cni (rw)
/etc/kube-flannel/ from flannel-cfg (rw)
/var/run/secrets/kubernetes.io/serviceaccount from flannel-token-2bnmk (ro)
Containers:
kube-flannel:
Container ID: docker://cc05c1d6adc620ed6c05697380977acc14cff86b885b22bde51d928ba69b4e0c
Image: docker.io/rancher/mirrored-flannelcni-flannel:v0.20.0
Image ID: docker-pullable://rancher/mirrored-flannelcni-flannel@sha256:24e693e10c53c9d5dd78196f77cd5328d6b9d90aff203a37de07d3b040dc938d
Port: <none>
Host Port: <none>
Command:
/opt/bin/flanneld
Args:
--ip-masq
--kube-subnet-mgr
State: Waiting
Reason: CreateContainerConfigError
Last State: Terminated
Reason: ContainerCannotRun
Message: input/output error
Exit Code: 128
Started: Wed, 26 Oct 2022 12:18:23 +0000
Finished: Wed, 26 Oct 2022 12:18:23 +0000
Ready: False
Restart Count: 353
Limits:
cpu: 100m
memory: 50Mi
Requests:
cpu: 100m
memory: 50Mi
Environment:
POD_NAME: kube-flannel-ds-lwztz (v1:metadata.name)
POD_NAMESPACE: kube-flannel (v1:metadata.namespace)
EVENT_QUEUE_DEPTH: 5000
Mounts:
/etc/kube-flannel/ from flannel-cfg (rw)
/run/flannel from run (rw)
/run/xtables.lock from xtables-lock (rw)
/var/run/secrets/kubernetes.io/serviceaccount from flannel-token-2bnmk (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run:
Type: HostPath (bare host directory volume)
Path: /run/flannel
HostPathType:
cni-plugin:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
flannel-cfg:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kube-flannel-cfg
Optional: false
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
flannel-token-2bnmk:
Type: Secret (a volume populated by a Secret)
SecretName: flannel-token-2bnmk
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoScheduleop=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events: <none>
实在不行,重新join。