脚本检测
有关
nvidia-smi
指令结果的详细分析、GPU占用与显卡占用的区别,详见参考文章nvidia-smi查看GPU的使用信息并分析
GPU占用低于5%
#!/bin/bash
# 循环,直到找到一个占用低于5%的GPU
while true; do
# 使用nvidia-smi获取所有GPU的利用率
# --query-gpu=utilization.gpu --format=csv,noheader,nounits 返回逗号分隔的GPU使用率
mapfile -t GPU_USAGES < <(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits)
# 检查每个GPU的利用率
for usage in "${GPU_USAGES[@]}"; do
if [[ $usage -lt 5 ]]; then
echo "A GPU is idle with usage $usage%, running the script..."
# 运行Python脚本
python script.py
exit 0 # 如果只希望在任意GPU空闲时运行一次脚本,执行后退出
fi
done
echo "All GPUs are busy, usage above 5%. Checking again in 60 seconds..."
# 没有空闲的GPU,等待60秒后再次检查
sleep 60
done
显卡占用
#!/bin/bash
# 设定显存占用的阈值,此处设为5%
threshold=5
# 循环,直到找到显存占用低于阈值的GPU
while true; do
echo "Current time: $(date '+%Y-%m-%d %H:%M:%S')"
# 使用nvidia-smi获取所有GPU的显存总量和已使用的显存
# --query-gpu=memory.total,memory.used --format=csv,noheader,nounits 返回逗号分隔的显存总量和已使用量
mapfile -t GPU_MEMORIES < <(nvidia-smi --query-gpu=memory.total,memory.used --format=csv,noheader,nounits)
# 检查每个GPU的显存占用率
for mem_info in "${GPU_MEMORIES[@]}"; do
IFS=',' read -ra MEM <<< "$mem_info"
total=${MEM[0]}
used=${MEM[1]}
# 计算显存占用百分比
usage=$(awk "BEGIN {printf \"%.2f\", ($used/$total)*100}")
# 检查显存占用是否低于阈值
if (( $(echo "$usage < $threshold" | bc -l) )); then
echo "A GPU is idle with memory usage $usage%, running the script..."
# 运行Python脚本
python main.py
exit 0 # 如果只希望在任意GPU显存空闲时运行一次脚本,执行后退出
fi
done
echo "All GPUs are busy, memory usage above $threshold%. Checking again in 60 seconds..."
# 没有空闲的GPU显存,等待60秒后再次检查
sleep 60
done
python选用GPU
解决问题:即使找到有空闲GPU=1,python一直坚持用占满的GPU=0并报错:
RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 23.69 GiB total capacity; 5.04 GiB already allocated; 2.94 MiB free; 5.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
检查GPU状态
import subprocess
import os
def get_idle_gpu(min_memory_free=5000):
"""
返回第一个空闲的GPU ID,如果没有找到则返回None。
min_memory_free指定最小的空闲显存量(单位为MiB)。
"""
# 执行nvidia-smi命令获取GPU的当前状态
try:
result = subprocess.run(
['nvidia-smi', '--query-gpu=index,memory.free', '--format=csv,nounits,noheader'],
encoding='utf-8',
stdout=subprocess.PIPE,
check=True
).stdout
# 解析输出,找到空闲显存大于min_memory_free的GPU
for line in result.strip().split('\n'):
gpu_index, memory_free = [int(x) for x in line.split(',')]
if memory_free >= min_memory_free:
return gpu_index
except subprocess.CalledProcessError:
print("Failed to run nvidia-smi")
return None
return None
指定空闲GPU
gpu_id = get_idle_gpu()
if gpu_id is not None:
print(f"Using GPU: {gpu_id}")
# 设置环境变量,以便只使用选择的GPU
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
# 现在可以安全地导入并使用你的深度学习框架了
else:
print("No idle GPU found. Consider waiting or using CPU.")