清除莫名其妙的显卡内存占用
问题描述
最近我在用vscode远程连接服务器跑实验的时候由于网络卡掉,导致实验中断,然后我重新跑实验时,发现一直报错cuda out of memory,我用nvidia-smi
命令查看发现没有显示哪个pid在使用GPU3
---------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:04:00.0 On | N/A |
| 30% 35C P8 18W / 350W | 185MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A |
| 39% 34C P8 24W / 350W | 10MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:08:00.0 Off | N/A |
| 54% 33C P8 22W / 350W | 10MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:09:00.0 Off | N/A |
| 30% 36C P8 16W / 350W | 18571MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1632 G /usr/lib/xorg/Xorg 35MiB |
| 0 N/A N/A 5214 G /usr/lib/xorg/Xorg 90MiB |
| 0 N/A N/A 5374 G /usr/bin/gnome-shell 24MiB |
| 0 N/A N/A 5447 G ...mviewer/tv_bin/TeamViewer 6MiB |
| 0 N/A N/A 1816342 G /usr/lib/firefox/firefox 11MiB |
| 1 N/A N/A 1632 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 5214 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1632 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 5214 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1632 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 5214 G /usr/lib/xorg/Xorg 4MiB |
解决方法
-
找到GPU3被使用的pid号
fuser -v /dev/nvidia*
发现信息没有显示完全,不确定哪个pid是我的
/dev/nvidia3: biiteam 5374 F...m gnome-shell biiteam 5447 F...m TeamViewer biiteam 1726146 F...m python3.8 biiteam 1726151 F...m python3.8 biiteam 1726154 F...m python3.8 biiteam 1726157 F...m python3.8 biiteam 1726160 F...m python3.8 biiteam 1726163 F...m python3.8 biiteam 1726166 F...m python3.8 biiteam 1726169 F...m python3.8 biiteam 1726172 F...m python3.8 biiteam 1726175 F...m python3.8 biiteam 1816342 F...m firefox biiteam 3032698 F.... python biiteam 3032699 F.... python biiteam 3032700 F.... python biiteam 3032701 F.... python
-
显示上面这些pid号的具体信息
ps -p 5374 5447 1726146 1726151 1726154 1726157 1726160 1726163 1726166 1726169 1726172 1726175 1816342 3032698 3032699 3032700 3032701
发现以1726开头的是我自己实验的pid号
PID TTY STAT TIME COMMAND 5374 ? Ssl 116:08 /usr/bin/gnome-shell 5447 ? Sl 272:42 /opt/teamviewer/tv_bin/TeamViewer 1726146 ? Sl 21:53 /home/biiteam/anaconda3/envs/zj/bin/python3.8 /home/biiteam/.vscode-server-insiders/extensions/ms-python.python-2024.0.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy 1726151 ? Sl 33:24 /home/biiteam/anaconda3/envs/zj/bin/python3.8 /home/biiteam/.vscode-server-insiders/extensions/ms-python.python-2024.0.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy 1726154 ? Sl 33:26 /home/biiteam/anaconda3/envs/zj/bin/python3.8 /home/biiteam/.vscode-server-insiders/extensions/ms-python.python-2024.0.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy 1726157 ? Sl 33:41 /home/biiteam/anaconda3/envs/zj/bin/python3.8 /home/biiteam/.vscode-server-insiders/extensions/ms-python.python-2024.0.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy 1726160 ? Sl 33:16 /home/biiteam/anaconda3/envs/zj/bin/python3.8 /home/biiteam/.vscode-server-insiders/extensions/ms-python.python-2024.0.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy 1726163 ? Sl 33:25 /home/biiteam/anaconda3/envs/zj/bin/python3.8 /home/biiteam/.vscode-server-insiders/extensions/ms-python.python-2024.0.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy 1726166 ? Sl 33:32 /home/biiteam/anaconda3/envs/zj/bin/python3.8 /home/biiteam/.vscode-server-insiders/extensions/ms-python.python-2024.0.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy 1726169 ? Sl 33:30 /home/biiteam/anaconda3/envs/zj/bin/python3.8 /home/biiteam/.vscode-server-insiders/extensions/ms-python.python-2024.0.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy 1726172 ? Sl 34:36 /home/biiteam/anaconda3/envs/zj/bin/python3.8 /home/biiteam/.vscode-server-insiders/extensions/ms-python.python-2024.0.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy 1726175 ? Sl 33:21 /home/biiteam/anaconda3/envs/zj/bin/python3.8 /home/biiteam/.vscode-server-insiders/extensions/ms-python.python-2024.0.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy 1816342 ? Sl 1836:07 /usr/lib/firefox/firefox -new-window 3032698 pts/3 Sl+ 3849:52 /home/biiteam/anaconda3/envs/nj/bin/python -u basicsr/train.py --local_rank=0 -opt Deraining/Options_resvmamba/Deraining_ResVMambaXt_m_4gpu_2x.yml --launcher pytorch 3032699 pts/3 Sl+ 3849:11 /home/biiteam/anaconda3/envs/nj/bin/python -u basicsr/train.py --local_rank=1 -opt Deraining/Options_resvmamba/Deraining_ResVMambaXt_m_4gpu_2x.yml --launcher pytorch 3032700 pts/3 Sl+ 3849:08 /home/biiteam/anaconda3/envs/nj/bin/python -u basicsr/train.py --local_rank=2 -opt Deraining/Options_resvmamba/Deraining_ResVMambaXt_m_4gpu_2x.yml --launcher pytorch 3032701 pts/3 Sl+ 3849:59 /home/biiteam/anaconda3/envs/nj/bin/python -u basicsr/train.py --local_rank=3 -opt Deraining/Options_resvmamba/Deraining_ResVMambaXt_m_4gpu_2x.yml --launcher pytorch
-
使用命令将找到的pid全部杀掉
kill -9 1726146 1726151 1726154 1726157 1726160 1726163 1726166 1726169 1726172 1726175
-
再次使用
nvidia-smi
命令查看,发现没有内存占用了+-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... Off | 00000000:09:00.0 Off | N/A | | 31% 37C P8 26W / 350W | 10MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+