报错截图
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
, and by
making sure allforward
function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’sforward
function. Please include the loss function and the structure of the return value offorward
of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 109 110
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
报错原因
有部分参数初始化了,但是并没有在模型的foward过程中使用,因此没有梯度无法反传参数更新。
解决方法
方法一:直接忽略该问题
在配置文件中,加入:
find_unused_parameters = True
如下图所示:
方法二:找出没有进行forward的参数,针对性解决问题
- 如果确实某些网络结构是不需要使用的,那我们就直接去除即可,但首先我们需要debug出哪些参数和结构是没有被用到了,然后针对性去除即可:
找出参数
- 只需要在你正常的分布式命令前加入
TORCH_DISTRIBUTED_DEBUG=DETAIL
即可:
TORCH_DISTRIBUTED_DEBUG=DETAIL bash tools/dist_train.sh config/xxx.py 1
运行后的得到具体没有梯度的参数:
注释掉这些网络结构即可~
- 当然,有一种情况就是,比如backbone里的feature map只用了一层,但是backbone默认出多层的feature map,这也会导致以上问题,这种情况下我们没法直接改网络结构,因此只能使用方法一解决。
参考文章: