概括一点就是 kubelet 会每隔 10 秒去获取当前节点容器的健康状态,如果超过 3 分钟还没有完成,就会提示 PLEG is not healthy。
PLEG is not healthy 最直观的现象如下图:
出现 PLEG is not healthy 后,节点无法再正常的调度运行 Pod。
PLEG is not healthy 分析
当出现 PLEG is not healthy,在 kubelet 日志中应该可以看到类似如下的错误日志:
E0708 16:45:53.999538 2103 pod_workers.go:190] Error syncing pod d9cbbfb8-4ae6-4758-b3d4-fc106db98316 ("wordpress-6bcf994cbd-8fbzd_default(d9cbbfb8-4ae6-4758-b3d4-fc106db98316)"), skipping: failed to "StartContainer"for"wordpress" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=wordpress pod=wordpress-6bcf994cbd-8fbzd_default(d9cbbfb8-4ae6-4758-b3d4-fc106db98316)"
I0708 16:45:35.881181 2103 kubelet.go:2101] Failed to delete pod "rancher-logging-fluentd-linux-qw6cv_cattle-logging(8536a1ac-d253-485c-9f29-fecd7bd4a42d)", err: pod not found
I0708 16:45:35.881181 2103 kubelet.go:2101] Failed killing the pod "rancher-logging-fluentd-linux-qw6cv_cattle-logging(8536a1ac-d253-485c-9f29-fecd7bd4a42d)", err: pod not found
E0729 11:21:02.178079 9575 remote_runtime.go:321] ContainerStatus "052b011f5fcc54536223afb2d544344c58e62b96bc836a64140c5e36af004ceb" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0729 11:21:02.178130 9575 kuberuntime_manager.go:917] getPodContainerStatuses for pod "bushuzu-nettools-88854d496-9nvlk_p000(472cb4d7-caf0-11ea-ab9d-ac1f6ba4811c)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0729 11:21:50.993186 9575 kubelet_pods.go:1093] Failed killing the pod "bushuzu-nettools-88854d496-9nvlk": failed to "KillContainer"for"bushuzu-nettools" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
I0729 11:20:59.245243 9575 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m57.138893648s ago; threshold is 3m0s.
问题处理
可能因为某些 K8S 依赖或者 docker 自身的原因,导致 Pod 或容器一直未能删除。
强制删除 Pod
kubectl -n <命名空间> delete pod --force --grace-period=0 xxx