配置
- Ubuntu20.04
- RTX3080Ti
问题描述
在anaconda环境下跑ML模型时,一段时间后显卡不输出图像,无信号,或者分辨率变成超级大的情况。重启后恢复正常。
通过查询 /var/log/syslog
发现显卡从总线中掉落:
kernel: [ 1527.515650] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
kernel: [ 1527.515652] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
原因分析
正常使用中无任何异常,并且模型需要跑一段时间才会出现问题。
通过系统监视器和显卡监视,发现
- 显存稳定
- 显卡温度稳定在70 C左右
- 内存稳定
因此,分析主要原因为:电源供电不足或者显卡运行频率过高 。
解决方案
1、通过以下指令查看当前显卡的工作频率和最大、最小工作频率
nvidia-smi -q -d clock
2、限制显卡的工作频率 min,max
sudo nvidia-smi -lgc 300,1600
1700MHz下的显卡状态
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 Ti On | 00000000:01:00.0 On | N/A |
| 85% 68C P2 223W / 350W | 11082MiB / 12288MiB | 70% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1407 G /usr/lib/xorg/Xorg 118MiB |
| 0 N/A N/A 1546 G /usr/bin/gnome-shell 94MiB |
| 0 N/A N/A 3075 G ...03,262144 --variations-seed-version 90MiB |
| 0 N/A N/A 3906 G ...erProcess --variations-seed-version 55MiB |
| 0 N/A N/A 5889 C ...naconda3/envs/convnextv2/bin/python 5344MiB |
| 0 N/A N/A 6902 C ...naconda3/envs/convnextv2/bin/python 5356MiB |
+---------------------------------------------------------------------------------------+
GPU 00000000:01:00.0
Clocks
Graphics : 1695 MHz
SM : 1695 MHz
Memory : 9251 MHz
Video : 1485 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 9501 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
SM Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
3、保持显卡处于永久保持状态
sudo nvidia-smi -pm 1
4、2和3指令会在每次重启后失效,可以将其添加到系统自启动项\etc\rc.local
中
目前使用暂未出现异常。