问题
安装环境运行模型时报错:CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch.
解决办法
修改文件 /usr/local/lib/python3.8/dist-packages/torch/cuda# vim __init__.py
_cached_device_count: Optional[int] = None
def device_count() -> int:
r"""Return the number of GPUs available."""
global _cached_device_count
if not _is_compiled():
return 0
if _cached_device_count is not None:
return _cached_device_count
# bypass _device_count_nvml() if rocm (not supported)
nvml_count = -1 if torch.version.hip else _device_count_nvml()
r = torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count
# NB: Do not cache the device count prior to CUDA initialization, because
# the number of devices can change due to changes to CUDA_VISIBLE_DEVICES
# setting prior to CUDA initialization.
if _cached_device_count is None and _initialized:
_cached_device_count = r
return r
执行代码的时指定设备
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
os.environ["WORLD_SIZE"] = "1"