yolov5单机多卡训练报错
Traceback (most recent call last):
File “train.py”, line 638, in
main(opt)
File “train.py”, line 532, in main
train(opt.hyp, opt, device, callbacks)
File “train.py”, line 113, in train
data_dict = data_dict or check_dataset(data) # check if None
File “/home/stxx/renxs/anaconda3/envs/mmyolo/lib/python3.8/contextlib.py”, line 120, in exit
Traceback (most recent call last):
File “train.py”, line 638, in
main(opt)
File “train.py”, line 532, in main
train(opt.hyp, opt, device, callbacks)
File “train.py”, line 112, in train
with torch_distributed_zero_first(LOCAL_RANK):
File “/home/stxx/renxs/anaconda3/envs/mmyolo/lib/python3.8/contextlib.py”, line 113, in enter
next(self.gen)
return next(self.gen) File “/home/stxx/syy/yolov5-3class/yolov5/utils/torch_utils.py”, line 94, in torch_distributed_zero_first
File “/home/stxx/syy/yolov5-3class/yolov5/utils/torch_utils.py”, line 91, in torch_distributed_zero_first
dist.barrier(device_ids=[local_rank])
TypeError: barrier() got an unexpected keyword argument ‘device_ids’
dist.barrier(device_ids=[0])
TypeError: barrier() got an unexpected keyword argument ‘device_ids’
Traceback (most recent call last):
File “/home/stxx/renxs/anaconda3/envs/mmyolo/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/stxx/renxs/anaconda3/envs/mmyolo/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/stxx/syy/mmyolo_venv/mmyolo_venv/lib/python3.8/site-packages/torch/distributed/launch.py”, line 260, in
main()
File “/home/stxx/syy/mmyolo_venv/mmyolo_venv/lib/python3.8/site-packages/torch/distributed/launch.py”, line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
解决方案
更换pytorch版本,我原来的版本是
torch1.7.1+cu110、 torchvision0.8.2
更换到
torch1.8.0+cu111 torchvision0.9.0+cu111
因为我的cuda是11,所以安装如下
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
就可以多卡运行train了
python -m torch.distributed.launch --nproc_per_node=2 --master_port 8089 train.py --device 2,3