NVIDIA-SMI 显示ERR! ERR! ERR! 错误!风扇和电源使用情况

文章提供了针对GPU过热问题的解决方案,包括将GPU移到工作站较凉爽的位置,设置功率限制以保持峰值温度不超过75C,以及使用nvidia-smi命令调整持久性模式和功率极限。此外,还讨论了GPU重置选项,用于处理双位ECC错误或其他需要避免机器重启的情况,但警告说GPU重置并不总能成功,并建议在重置后验证GPU的健康状态。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

解决方案

命令解释

DEVICE MODIFICATION OPTIONS
 [any one of]
 -pm, --persistence-mode=MODE
 Set the persistence mode for the target GPUs. See the (GPU ATTRIBUTES)
 section for a description of persistence mode. Requires root. Will
 impact all GPUs unless a single GPU is specified using the -i argument.
 The effect of this operation is immediate. However, it does not per-
 sist across reboots. After each reboot persistence mode will default
 to "Disabled". Available on Linux only.
 为目标GPU设置持久性模式。有关持久性模式的描述,请参阅(GPU属性)部分。需要root用户。将影响所有GPU,除非使用-i参数指定了单个GPU。这次行动的效果是立竿见影的。然而,它不会在重新启动时持续存在。每次重新启动后,持久性模式将默认为“已禁用”。仅在Linux上可用。
 
-pl, --power-limit=POWER_LIMIT
 Specifies maximum power limit in watts. Accepts integer and floating
 point numbers. Only on supported devices from Kepler family. Requires
 administrator privileges. Value needs to be between Min and Max Power
 Limit as reported by nvidia-smi.
 以瓦特为单位指定最大功率限制。接受整数和浮点数字。仅适用于开普勒家族支持的设备。需要管理员权限。根据nvidia smi的报告,该值需要介于最小和最大功率限制之间。

-r, --gpu-reset
 Trigger a reset of the GPU. Can be used to clear GPU HW and SW state
 in situations that would otherwise require a machine reboot. Typically
 useful if a double bit ECC error has occurred. Requires -i switch to
 target specific device. Requires root. There can't be any applica-
 tions using this particular device (e.g. CUDA application, graphics
 application like X server, monitoring application like other instance
 of nvidia-smi). There also can't be any compute applications running
 on any other GPU in the system. Only on supported devices from Fermi
 and Kepler family running on Linux.
 GPU reset is not guaranteed to work in all cases. It is not recommended  for production environments at this time. In some situations there may
 be HW components on the board that fail to revert back to an initial
 state following the reset request. This is more likely to be seen on
 Fermi-generation products vs. Kepler, and more likely to be seen if the
 reset is being performed on a hung GPU.
 Following a reset, it is recommended that the health of the GPU be ver-
 ified before further use. The nvidia-healthmon tool is a good choice
 for this test. If the GPU is not healthy a complete reset should be
 instigated by power cycling the node.
 Visit http://developer.nvidia.com/gpu-deployment-kit to download the
 GDK and nvidia-healthmon.
  • https://developer.nvidia.com/nvidia-system-management-interface
  • https://developer.download.nvidia.cn/compute/DCGM/docs/nvidia-smi-367.38.pdf
    在这里插入图片描述

CG

  • 其他相关命令:nvtop,nvitop
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值