问题描述:使用Recbole时因为服务器的前几个gpu都被占用,想指定gpu_id为4,尝试使用命令行、yaml文件和config dict指定,均报错:
RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 23.70 GiB total capacity; 1.87 GiB already allocated; 49.56 MiB free; 1.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
大概意思时说GPU 0的内存不足,但是程序打印的参数列表显示gpu_id=4,说明程序无法识别指定gpu,可能有bug。
解决方法:使用命令行运行代码时在python run.py前添加CUDA_VISIBLE_DEVICES=4即可。
CUDA_VISIBLE_DEVICES=4 python run.py