RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

叙利亚男篮

已于 2024-11-14 15:28:40 修改

阅读量136

点赞数 4

文章标签： python 深度学习

于 2024-11-14 02:40:09 首次发布

本文链接：https://blog.csdn.net/lxy_JavaSpace/article/details/143756182

版权

暂未解决

已解决：

方法一：

按照提示里写的把

torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], broadcast_buffers=False, find_unused_parameters=False)中的find_unused_parameters改为True

但我这样写会报新的错，大家可以试一试，我看github上大部分人这么做都成功了

方法二：

运行时加上TORCH_DISTRIBUTED_DEBUG=DETAIL，查看具体报错的是哪个模块，我的问题是自己写的decoder中部分MoudelList定义了（即在init部分写了self.xxx=xxx），但没有参与loss梯度回传，删掉对应的多出来init的网络模块就可以了。

具体可以看一下这篇RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.-CSDN博客