写在前面
自己的测试环境:
Ubuntu20.04,python3.8
一、问题描述
运行 python 程序时,遇到如下报错:
Traceback (most recent call last):
File "/home/wong/ProgramFiles/anaconda3/envs/pytorch_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/wong/ProgramFiles/anaconda3/envs/pytorch_env/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/home/wong/Datum/workspace_demo/packagetest_backbone_vs_pooling_ws/src/packagetest_backbone_vs_pooling/backbone_vs_pooling/train.py", line 103, in train_test
backbone.to(device)
File "/home/wong/ProgramFiles/anaconda3/envs/pytorch_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/home/wong/ProgramFiles/anaconda3/envs/pytorch_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/wong/ProgramFiles/anaconda3/envs/pytorch_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/wong/ProgramFiles/anaconda3/envs/pytorch_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/home/wong/ProgramFiles/anaconda3/envs/pytorch_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/home/wong/ProgramFiles/anaconda3/envs/pytorch_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/home/wong/ProgramFiles/anaconda3/envs/pytorch_env/lib/python3.8/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "main.py", line 138, in <module>
main(args)
File "main.py", line 109, in main
all_results = pool.starmap(train_test, params)
File "/home/wong/ProgramFiles/anaconda3/envs/pytorch_env/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/home/wong/ProgramFiles/anaconda3/envs/pytorch_env/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
RuntimeError: CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions
二、解决方法
这个错误是由于 CUDA
初始化失败,并且在多进程情况下,由于同时访问 GPU 资源导致了冲突。以下是自己测试成功的解决方案:
默认的多进程启动方法是 fork,在 GPU 使用时可能导致 CUDA 初始化失败。将启动方法修改为 spawn 可以避免这个问题。
在主程序中添加以下内容:
import torch.multiprocessing as mp
if __name__ == '__main__':
mp.set_start_method('spawn', force=True) # 设置启动方法为 spawn
main(args)
然后再次运行程序, 应该可以运行成功。
参考链接
[1] chat.