实时监测GPU的显存和显存清理小功能学习

最新推荐文章于 2025-03-18 10:15:05 发布

colourmind

最新推荐文章于 2025-03-18 10:15:05 发布

阅读量6.9k

点赞数 1

分类专栏： Python编程文章标签： python pytorch

本文链接：https://blog.csdn.net/HUSTHY/article/details/107733080

版权

Python编程专栏收录该内容

13 篇文章

订阅专栏

本文详细介绍如何使用pynvml库实时监控NVIDIA GPU的显存、温度和电源状态，并提供代码示例。同时，文章还介绍了如何在Python中通过torch库释放GPU显存，包括使用torch.cuda.empty_cache()和del()函数进行显存清理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、pynvml库的简单使用

二、显存清理

在跑神经网络训练或者推理的时候，有的时候很有必要实时监测一下显存的状态。NVIDIA显卡在终端管理界面，使用命令：watch -n 3 nvidia-smi可以指定每隔几秒中来显示一下显卡信息。当然NVIDIA也是开发了python库，很方便的实时查看GPU信息。

一、pynvml库的简单使用

这个pynvml库是NVIDIA为自己家显卡开发的一个获取显卡当前信息的python包。我们一般比较关注的就是显卡实时的显存量信息、温度信息和电源信息，这个库都有相应的接口来实现实时查看的功能，非常方便。直接上代码：

pynvml.nvmlInit()#初始化 
pynvml.nvmlDeviceGetCount()#设备数量
pynvml.nvmlDeviceGetHandleByIndex(i)#显卡句柄
pynvml.nvmlDeviceGetName(handle)#显卡名称
memo_info = pynvml.nvmlDeviceGetMemoryInfo(handle)#显存信息
memo_info.total#总显存
memo_info.free#空余显存
memo_info.used#已经使用的显存
pynvml.nvmlDeviceGetTemperature(handle, 0)#温度
pynvml.nvmlDeviceGetFanSpeed(handle)#风扇速度
pynvml.nvmlDeviceGetPowerState(handle)#电源状态

import torch
import pynvml

pynvml.nvmlInit()#初始化
#设备情况
deviceCount = pynvml.nvmlDeviceGetCount()
print('显卡数量：',deviceCount)
for i in range(deviceCount):
    handle = pynvml.nvmlDeviceGetHandleByIndex(i)
    gpu_name = pynvml.nvmlDeviceGetName(handle)
    print('GPU %d is :%s'%(i,gpu_name))

    #显存信息
    memo_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
    print("GPU %d Memory Total: %.4f G"%(i,memo_info.total/1024/1024/1000) )
    print("GPU %d Memory Free: %.4f G"%(i,memo_info.free/1024/1024/1000))
    print("GPU %d Memory Used: %.4f G"%(i,memo_info.used/1024/1024/1000))

    #温度
    Temperature = pynvml.nvmlDeviceGetTemperature(handle, 0)
    print("Temperature is %.1f C" %(Temperature))

    #风扇转速
    speed = pynvml.nvmlDeviceGetFanSpeed(handle)
    print("Fan speed is ",speed)

    #电源状态
    power_ststus = pynvml.nvmlDeviceGetPowerState(handle)
    print("Power ststus", power_ststus)
#关闭
pynvml.nvmlShutdown()

结果如下：

二、显存清理

......
other codes
......
del(model)
torch.cuda.empty_cache()

有的时候需要程序运行过程中把显存清理掉，就可以采用上面的代码，完整代码如下：

import torch
import pynvml
from transformers import BertModel


def get_gpu_memory(handle):
    meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
    free = meminfo.free/1024/1024/1000
    return free


if __name__ == "__main__":
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    print('初始显存：%.4f G'%get_gpu_memory(handle))
    model = BertModel.from_pretrained('./output/training_patent_sbert-Chinese-BERT-wwm2019-10-09_10-42-20_with_20K_Trains/0_BERT/')


    device = torch.device('cuda:0')
    model.to(device)
    print('加载Bert模型后，剩余显存：%.4f G' % get_gpu_memory(handle))


    dummy_tensor_4 = torch.randn(370, 60, 510, 510).float().to(device)
    print('加载数据转到GPU上后，剩余显存：%.4f G'%get_gpu_memory(handle))


    # 然后释放
    dummy_tensor_4 = dummy_tensor_4.cpu()
    print('把GPU上的数据转移到CPU上，剩余显存：%.4f G'%get_gpu_memory(handle))

    torch.cuda.empty_cache()
    print('torch.cuda.empty_cache清理显存后，显存是：%.4f G' % get_gpu_memory(handle))
    del(model)
    print('del(model)清理显存后，显存是：%.4f G'%get_gpu_memory(handle))
    pynvml.nvmlShutdown()  # 最后关闭管理工具

结果如下：