如何解决k8s启动Job一直pending的问题

一、问题是怎么发现的

线上启动形象训练任务,原本线上有2台GPU节点,每个节点4张P40显卡,本应支持2个并发,但运营平台看到只有一个在训练中的任务,另外一个任务一直处在排队中。

二、问题带来的影响

任务迟迟不运行,一来会影响业务按时拿到训练结果,二来会造成服务器资源的浪费。

三、排查问题的详细过程

1、登录服务器,通过kubectl查看pod状态,发现其中的一个节点,创建的JOB一直处于pending状态。

[root@master-01 ~]# kubectl get pod -n ai-train -o wide 
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 
job-3343-train-run-001-61gbw-hzsgz 1/1 Running 0 4h18m 10.244.78.239 k8s-worker-04 
job-3355-train-run-001-elz3o-8ql5s 0/1 Pending 0 137m 

2、通过kubectl查看pod详情。

kubectl describe pod -n ai-train job-3355-train-run-001-elz3o-8ql5s 

查看告警:

Events: 
 Type Reason Age From Message 
 ---- ------ ---- ---- ------- 
 Warning FailedScheduling 115m default-scheduler 0/4 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the p
od didn't tolerate, 2 Insufficient nvidia.com/gpu. 

四、如何解决问题

1、查看节点资源情况

kubectl describe nodes 

Addresses: 
 InternalIP: 11.167.230.17 
  Hostname: k8s-worker-05 
Capacity: 
   cpu: 56 
   ephemeral-storage: 515927276Ki 
   hugepages-1Gi: 0 
   hugepages-2Mi: 0 
   memory: 226811716Ki 
   nvidia.com/gpu: 4 
   pods: 110 
Allocatable: 
   cpu: 56 
   ephemeral-storage: 475478576775 
   hugepages-1Gi: 0 
   hugepages-2Mi: 0 
   memory: 226709316Ki 
   nvidia.com/gpu: 3 
   pods: 110 

发现nvidia.com/gpu有一张卡不能用。

2、检查显卡状态

[root@k8s-worker-05 ~]# nvidia-smi
Wed Apr  3 15:36:26 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:00:0B.0 Off |                    0 |
| N/A   26C    P8    10W / 250W |      0MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:00:0C.0 Off |                    0 |
| N/A   25C    P8    10W / 250W |      0MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 00000000:00:0D.0 Off |                    0 |
| N/A   26C    P8    10W / 250W |      0MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 00000000:00:0E.0 Off |                    0 |
| N/A   28C    P8    10W / 250W |      0MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

发现一切正常。

3、查看是否某些容器占用了GPU资源

[root@master-01 ~]# kubectl get pod -A -o wide |grep gpu
monitoring         nvidia-gpu-exporter-c6gkt                          1/1     Running             2          168d    10.244.117.139   k8s-worker-09   <none>           <none>
monitoring         nvidia-gpu-exporter-n625j                          1/1     Running             0          190d    10.244.78.194    k8s-worker-04   <none>           <none>
monitoring         nvidia-gpu-exporter-z72pm                          1/1     Running             1          178d    10.244.55.231    k8s-worker-05   <none>           <none>

并没有。

4、查看是否有通过Docker启动的占用GPU资源的容器

docker ps |grep gpu
9c23e030ed05   9f159534dd33                                                                      "/usr/bin/nvidia_gpu…"   5 months ago   Up 5 months             k8s_nvidia-gpu-exporter_nvidia-gpu-exporter-z72pm_monitoring_c178ff23-6bda-4b07-9b52-1d8525f71b80_1
cfd5cfe864a4   k8s.gcr.io/pause:3.4.1                                                            "/pause"                 5 months ago   Up 5 months             k8s_POD_nvidia-gpu-exporter-z72pm_monitoring_c178ff23-6bda-4b07-9b52-1d8525f71b80_63

也没有。

5、只能重启k8s-worker-05节点尝试一下了。

reboot

6、问题解决

[root@master-01 ~]# kubectl get pod -n ai-train
NAME                                 READY   STATUS    RESTARTS   AGE
job-3343-train-run-001-61gbw-hzsgz   1/1     Running   0          4h37m
job-3355-train-run-001-elz3o-8ql5s   1/1     Running   0          156m

五、总结反思

有些时候,如果实在找不到解决问题的办法,不防重启下出问题的服务器,也是一种【万不得已】的解决方案。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值