问题
线上开启GPU虚拟化后,占用本地内存,导致线上集群节点频繁重启。在对线上节点升配后,有一台节点一直起不来,查看节点event如下:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 10m kube-proxy
Normal Starting 45m kubelet Starting kubelet.
Normal Starting 45m kubelet Starting kubelet.
Normal Starting 45m kubelet Starting kubelet.
Normal Starting 45m kubelet Starting kubelet.
Normal Starting 45m kubelet Starting kubelet.
Normal Starting 44m kubelet Starting kubelet.
Normal Starting 44m kubelet Starting kubelet.
Normal Starting 44m kubelet Starting kubelet.
Normal Starting 44m kubelet Starting kubelet.
Normal Starting 44m kubelet Starting kubelet.
Normal Starting 43m kubelet Starting kubelet.
Normal Starting 43m kubelet Starting kubelet.
Normal Starting 43m kubelet Starting kubelet.
Normal Starting 43m kubelet Starting kubelet.
Normal Starting 43m kubelet Starting kubelet.
Normal Starting 43m kubelet Starting kubelet.
Normal Starting 42m kubelet Starting kubelet.
Normal Starting 42m kubelet Starting kubelet.
Normal Starting 42m kubelet Starting kubelet.
Normal Starting 42m kubelet Starting kubelet.
Normal Starting 42m kubelet Starting kubelet.
Normal Starting 42m kubelet Starting kubelet.
解决
查看kubelet日志:
journalctl
可以通过如下方式导出线上日志,用于保留现场:
journalctl --since="2023-7-22 15:00:00" -u kubelet > kubelet.log
发现如下错误:
Jul 22 19:16:34 gpu-node1 kubelet[19095]: E0722 19:16:34.185254 19095 kubelet.go:1437] "Failed to start ContainerManager" err="failed to build map of initial containers from runtime: no PodsandBox found with Id '787127a3facc5f73cdbd791ecd8377a767f79449d94d8fe4dc9ebf999d16b7bb'"
Jul 22 19:16:34 gpu-node1 systemd[1]: kubelet.service: main process exited, code=exited, status=1/FAILURE
Jul 22 19:16:34 gpu-node1 systemd[1]: Unit kubelet.service entered failed state.
Jul 22 19:16:34 gpu-node1 systemd[1]: kubelet.service failed.
解决方案:
可以通过删除未运行的容器解决
,
#根据容器的状态,删除Exited状态的容器
docker rm $(docker ps -qf status=exited)
或
#Docker 1.13版本以后
docker container prune
此为kubelet的一个bug,kubelet will panic when pause container lose during kubelet restarting