1、问题描述
gpu没有运行进程,但是显存一直占用,2号显卡
存在僵尸进程,占用6679G显存
[root@node-01 ~]# nvidia-smi
Wed Apr 12 16:41:08 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:5E:00.0 Off | 0 |
| N/A 50C P0 26W / 70W | 3901MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:5F:00.0 Off | 0 |
| N/A 38C P8 14W / 70W | 4MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:86:00.0 Off | 0 |
| N/A 68C P0 45W / 70W | 6679MiB / 15360MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:D8:00.0 Off | 0 |
| N/A 41C P8 15W / 70W | 60MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
2、解决方案
2.1 安装查找进程包
[root@node-01 ~]# yum install -y psmisc
2.2 查找僵尸进程
[root@node-01 ~]# fuser -v /dev/nvidia*
USER PID ACCESS COMMAND
/dev/nvidia0: spark 72859 F...m python
root 188098 F.... kubelet
root 214247 F...m python
root 217073 F.... nvidia-device-p
/dev/nvidia1: root 188098 F.... kubelet
root 214247 F...m python
root 217073 F.... nvidia-device-p
/dev/nvidia2: root 188098 F.... kubelet
root 214247 F...m python
root 217073 F.... nvidia-device-p
/dev/nvidia3: root 188098 F.... kubelet
root 214247 F...m python
root 217073 F.... nvidia-device-p
/dev/nvidiactl: spark 72859 F...m python
root 188098 F.... kubelet
root 214247 F...m python
root 217073 F.... nvidia-device-p
/dev/nvidia-uvm: spark 72859 F...m python
root 214247 F...m python
2号显卡
存在僵尸进程214247
2.3 杀掉僵尸进程
[root@node-01 ~]# kill -7 214247
2.4 确认显存
[root@node-01 ~]# nvidia-smi
Wed Apr 12 16:50:17 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:5E:00.0 Off | 0 |
| N/A 50C P0 26W / 70W | 3901MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:5F:00.0 Off | 0 |
| N/A 38C P8 15W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:86:00.0 Off | 0 |
| N/A 59C P8 17W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:D8:00.0 Off | 0 |
| N/A 41C P8 15W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
参考资料:
http://www.taodudu.cc/news/show-4248859.html