一、问题是怎么发现的
线上启动形象训练任务,原本线上有2台GPU节点,每个节点4张P40显卡,本应支持2个并发,但运营平台看到只有一个在训练中的任务,另外一个任务一直处在排队中。
二、问题带来的影响
任务迟迟不运行,一来会影响业务按时拿到训练结果,二来会造成服务器资源的浪费。
三、排查问题的详细过程
1、登录服务器,通过kubectl查看pod状态,发现其中的一个节点,创建的JOB一直处于pending状态。
[root@master-01 ~]# kubectl get pod -n ai-train -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
job-3343-train-run-001-61gbw-hzsgz 1/1 Running 0 4h18m 10.244.78.239 k8s-worker-04
job-3355-train-run-001-elz3o-8ql5s 0/1 Pending 0 137m
2、通过kubectl查看pod详情。
kubectl describe pod -n ai-train job-3355-train-run-001-elz3o-8ql5s
查看告警:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 115m default-scheduler 0/4 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the p
od didn't tolerate, 2 Insufficient nvidia.com/gpu.
四、如何解决问题
1、查看节点资源情况
kubectl describe nodes
Addresses:
InternalIP: 11.167.230.17
Hostname: k8s-worker-05
Capacity:
cpu: 56
ephemeral-storage: 515927276Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 226811716Ki
nvidia.com/gpu: 4
pods: 110
Allocatable:
cpu: 56
ephemeral-storage: 475478576775
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 226709316Ki
nvidia.com/gpu: 3
pods: 110
发现nvidia.com/gpu有一张卡不能用。
2、检查显卡状态
[root@k8s-worker-05 ~]# nvidia-smi
Wed Apr 3 15:36:26 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02 Driver Version: 440.118.02 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:00:0B.0 Off | 0 |
| N/A 26C P8 10W / 250W | 0MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:00:0C.0 Off | 0 |
| N/A 25C P8 10W / 250W | 0MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P40 Off | 00000000:00:0D.0 Off | 0 |
| N/A 26C P8 10W / 250W | 0MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P40 Off | 00000000:00:0E.0 Off | 0 |
| N/A 28C P8 10W / 250W | 0MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
发现一切正常。
3、查看是否某些容器占用了GPU资源
[root@master-01 ~]# kubectl get pod -A -o wide |grep gpu
monitoring nvidia-gpu-exporter-c6gkt 1/1 Running 2 168d 10.244.117.139 k8s-worker-09 <none> <none>
monitoring nvidia-gpu-exporter-n625j 1/1 Running 0 190d 10.244.78.194 k8s-worker-04 <none> <none>
monitoring nvidia-gpu-exporter-z72pm 1/1 Running 1 178d 10.244.55.231 k8s-worker-05 <none> <none>
并没有。
4、查看是否有通过Docker启动的占用GPU资源的容器
docker ps |grep gpu
9c23e030ed05 9f159534dd33 "/usr/bin/nvidia_gpu…" 5 months ago Up 5 months k8s_nvidia-gpu-exporter_nvidia-gpu-exporter-z72pm_monitoring_c178ff23-6bda-4b07-9b52-1d8525f71b80_1
cfd5cfe864a4 k8s.gcr.io/pause:3.4.1 "/pause" 5 months ago Up 5 months k8s_POD_nvidia-gpu-exporter-z72pm_monitoring_c178ff23-6bda-4b07-9b52-1d8525f71b80_63
也没有。
5、只能重启k8s-worker-05节点尝试一下了。
reboot
6、问题解决
[root@master-01 ~]# kubectl get pod -n ai-train
NAME READY STATUS RESTARTS AGE
job-3343-train-run-001-61gbw-hzsgz 1/1 Running 0 4h37m
job-3355-train-run-001-elz3o-8ql5s 1/1 Running 0 156m
五、总结反思
有些时候,如果实在找不到解决问题的办法,不防重启下出问题的服务器,也是一种【万不得已】的解决方案。