Pytorch分布式训练错误

最新推荐文章于 2024-05-03 17:10:10 发布

Wanderer001

最新推荐文章于 2024-05-03 17:10:10 发布

阅读量1.3w

点赞数 9

分类专栏：异常处理文章标签：计算机视觉深度学习机器学习

本文链接：https://blog.csdn.net/weixin_36670529/article/details/106729116

版权

异常处理专栏收录该内容

175 篇文章 13 订阅

订阅专栏

参考 PyTorch分布式训练简介 - 云+社区 - 腾讯云

参考 Pytorch分布式训练错误 - 云+社区 - 腾讯云

subprocess.CalledProcessError: Command ‘[’/home/labpos/anaconda3/envs/idr/bin/python’, ‘-u’, ‘main_distribute.py’, ‘–local_rank=1’]’ returned non-zero exit status 1.

pytorch DistributedDataParallel训练时遇到的问题

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /opt/conda/conda-

在DistributedDataParallel 中加入find_unused_parameters=True

model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[args.local_rank],output_device=args.local_rank, find_unused_parameters=True)

Wanderer001

关注

9
点赞
踩
15

收藏

觉得还不错? 一键收藏
打赏
13
评论
Pytorch分布式训练错误

subprocess.CalledProcessError: Command ‘[’/home/labpos/anaconda3/envs/idr/bin/python’, ‘-u’, ‘main_distribute.py’, ‘–local_rank=1’]’ returned non-zero exit status 1.pytorch DistributedDataParallel训练时遇到的问题RuntimeError: Expected to have finished reductio
复制链接

扫一扫