NVIDIA-SMI 显示ERR! ERR! ERR! 错误！风扇和电源使用情况

FakeOccupational

已于 2024-01-11 15:40:08 修改

阅读量5.1k

点赞数

分类专栏：其他文章标签： linux 服务器 java

于 2023-05-02 17:00:00 首次发布

本文链接：https://blog.csdn.net/ResumeProject/article/details/130341774

版权

其他专栏收录该内容

170 篇文章

订阅专栏

文章提供了针对GPU过热问题的解决方案，包括将GPU移到工作站较凉爽的位置，设置功率限制以保持峰值温度不超过75C，以及使用nvidia-smi命令调整持久性模式和功率极限。此外，还讨论了GPU重置选项，用于处理双位ECC错误或其他需要避免机器重启的情况，但警告说GPU重置并不总能成功，并建议在重置后验证GPU的健康状态。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

解决方案

此问题是由于温度较高。首先，您应该将问题卡重新拔插到工作站最凉爽的位置。其次，设置功率限制和风扇转速,确保峰值温度不超过75C,使用这些方法，我恢复了两个具有相同问题的1080ti卡。
sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 150（200）将它们更改为 150W 到 200W

命令解释

DEVICE MODIFICATION OPTIONS
 [any one of]
 -pm, --persistence-mode=MODE
 Set the persistence mode for the target GPUs. See the (GPU ATTRIBUTES)
 section for a description of persistence mode. Requires root. Will
 impact all GPUs unless a single GPU is specified using the -i argument.
 The effect of this operation is immediate. However, it does not per-
 sist across reboots. After each reboot persistence mode will default
 to "Disabled". Available on Linux only.
 为目标GPU设置持久性模式。有关持久性模式的描述，请参阅（GPU属性）部分。需要root用户。将影响所有GPU，除非使用-i参数指定了单个GPU。这次行动的效果是立竿见影的。然而，它不会在重新启动时持续存在。每次重新启动后，持久性模式将默认为“已禁用”。仅在Linux上可用。
 
-pl, --power-limit=POWER_LIMIT
 Specifies maximum power limit in watts. Accepts integer and floating
 point numbers. Only on supported devices from Kepler family. Requires
 administrator privileges. Value needs to be between Min and Max Power
 Limit as reported by nvidia-smi.
 以瓦特为单位指定最大功率限制。接受整数和浮点数字。仅适用于开普勒家族支持的设备。需要管理员权限。根据nvidia smi的报告，该值需要介于最小和最大功率限制之间。

-r, --gpu-reset
 Trigger a reset of the GPU. Can be used to clear GPU HW and SW state
 in situations that would otherwise require a machine reboot. Typically
 useful if a double bit ECC error has occurred. Requires -i switch to
 target specific device. Requires root. There can't be any applica-
 tions using this particular device (e.g. CUDA application, graphics
 application like X server, monitoring application like other instance
 of nvidia-smi). There also can't be any compute applications running
 on any other GPU in the system. Only on supported devices from Fermi
 and Kepler family running on Linux.
 GPU reset is not guaranteed to work in all cases. It is not recommended  for production environments at this time. In some situations there may
 be HW components on the board that fail to revert back to an initial
 state following the reset request. This is more likely to be seen on
 Fermi-generation products vs. Kepler, and more likely to be seen if the
 reset is being performed on a hung GPU.
 Following a reset, it is recommended that the health of the GPU be ver-
 ified before further use. The nvidia-healthmon tool is a good choice
 for this test. If the GPU is not healthy a complete reset should be
 instigated by power cycling the node.
 Visit http://developer.nvidia.com/gpu-deployment-kit to download the
 GDK and nvidia-healthmon.