在公用服务器上持续检测GPU显存占用，在空闲时运行python代码

最新推荐文章于 2024-05-21 13:24:40 发布

琳曦饵瑜

最新推荐文章于 2024-05-21 13:24:40 发布

阅读量169

点赞数 2

文章标签：服务器 python

本文链接：https://blog.csdn.net/weixin_51867095/article/details/138181202

版权

脚本检测

有关nvidia-smi指令结果的详细分析、GPU占用与显卡占用的区别，详见参考文章nvidia-smi查看GPU的使用信息并分析

GPU占用低于5%

#!/bin/bash

# 循环，直到找到一个占用低于5%的GPU
while true; do
    # 使用nvidia-smi获取所有GPU的利用率
    # --query-gpu=utilization.gpu --format=csv,noheader,nounits 返回逗号分隔的GPU使用率
    mapfile -t GPU_USAGES < <(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits)

    # 检查每个GPU的利用率
    for usage in "${GPU_USAGES[@]}"; do
        if [[ $usage -lt 5 ]]; then
            echo "A GPU is idle with usage $usage%, running the script..."
            # 运行Python脚本
            python script.py
            exit 0  # 如果只希望在任意GPU空闲时运行一次脚本，执行后退出
        fi
    done

    echo "All GPUs are busy, usage above 5%. Checking again in 60 seconds..."
    # 没有空闲的GPU，等待60秒后再次检查
    sleep 60
done

显卡占用

#!/bin/bash

# 设定显存占用的阈值，此处设为5%
threshold=5

# 循环，直到找到显存占用低于阈值的GPU
while true; do
    echo "Current time: $(date '+%Y-%m-%d %H:%M:%S')"

    # 使用nvidia-smi获取所有GPU的显存总量和已使用的显存
    # --query-gpu=memory.total,memory.used --format=csv,noheader,nounits 返回逗号分隔的显存总量和已使用量
    mapfile -t GPU_MEMORIES < <(nvidia-smi --query-gpu=memory.total,memory.used --format=csv,noheader,nounits)

    # 检查每个GPU的显存占用率
    for mem_info in "${GPU_MEMORIES[@]}"; do
        IFS=',' read -ra MEM <<< "$mem_info"
        total=${MEM[0]}
        used=${MEM[1]}

        # 计算显存占用百分比
        usage=$(awk "BEGIN {printf \"%.2f\", ($used/$total)*100}")

        # 检查显存占用是否低于阈值
        if (( $(echo "$usage < $threshold" | bc -l) )); then
            echo "A GPU is idle with memory usage $usage%, running the script..."
            # 运行Python脚本
            python main.py
            exit 0  # 如果只希望在任意GPU显存空闲时运行一次脚本，执行后退出
        fi
    done

    echo "All GPUs are busy, memory usage above $threshold%. Checking again in 60 seconds..."
    # 没有空闲的GPU显存，等待60秒后再次检查
    sleep 60
done

python选用GPU

解决问题：即使找到有空闲GPU=1，python一直坚持用占满的GPU=0并报错：
RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 23.69 GiB total capacity; 5.04 GiB already allocated; 2.94 MiB free; 5.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

检查GPU状态

import subprocess
import os

def get_idle_gpu(min_memory_free=5000):
    """
    返回第一个空闲的GPU ID，如果没有找到则返回None。
    min_memory_free指定最小的空闲显存量（单位为MiB）。
    """
    # 执行nvidia-smi命令获取GPU的当前状态
    try:
        result = subprocess.run(
            ['nvidia-smi', '--query-gpu=index,memory.free', '--format=csv,nounits,noheader'],
            encoding='utf-8',
            stdout=subprocess.PIPE,
            check=True
        ).stdout

        # 解析输出，找到空闲显存大于min_memory_free的GPU
        for line in result.strip().split('\n'):
            gpu_index, memory_free = [int(x) for x in line.split(',')]
            if memory_free >= min_memory_free:
                return gpu_index
    except subprocess.CalledProcessError:
        print("Failed to run nvidia-smi")
        return None

    return None

指定空闲GPU

gpu_id = get_idle_gpu()
if gpu_id is not None:
    print(f"Using GPU: {gpu_id}")
    # 设置环境变量，以便只使用选择的GPU
    os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
    # 现在可以安全地导入并使用你的深度学习框架了
else:
    print("No idle GPU found. Consider waiting or using CPU.")