在单卡跑代码的时候没有问题,多卡的时候出现报错信息:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
; (2) making sure all forward
function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
我觉的比较合适的解决方案是找到代码中哪一部分的参数没有接收到回传梯度,然后改正即可。可以在梯度回传之后、参数更新之前,查看每个参数的梯度。举一个简单的例子,一般的神经网络都会有计算损失-梯度回传-参数更新三个步骤,如下:
l = loss(predict, label)
l.backward()
optimizer.step()
那么在backward之后,optimizer更新参数之前,网络的每个参数都应当有回传梯度。这样在backward之后,step之前加入代码:
for name, param in model.named_parameters():
if param.grad is None:
print(name)
查看哪些参数没有接收到回传参数,并分析问题在哪。
实际上其它复杂的代码库中可能没有明显的三步骤,都被封装好了,但是遇到这种问题解决的思路都是一样的:找哪些参数没有回传梯度。举一个例子,假如使用mmdetection,用faster r-cnn做目标检测遇到了类似的问题,那么找到loss回传的位置,比如这里是在mmdet/models/detectors/two_stage.py
中TwoStageDetector
的forward_train
中:
roi_losses = self.roi_head.forward_train(x, img_metas, proposal_list,
gt_bboxes, gt_labels,
gt_bboxes_ignore, gt_masks,
**kwargs)
losses.update(roi_losses)
那么在update之后加入:
for name, p in self.roi_head.named_parameters():
# print(name)
if p.grad is None:
print(name)
就可以确定哪一部分出了问题。