问题描述:分布式训练中的inplace问题
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 64, 64]], which is output 0 of MaskedFillBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
text_to_img_exp,img_to_text_exp=map(lambda t: t.masked_fill_(pos_mask,0.)(text_to_img_exp,img_to_text_exp))
text_to_img_exp,img_to_text_exp=map(lambda t: t.masked_fill(pos_mask, 0.),(text_to_img_exp, img_to_text_exp))
上面是我的错误代码,下面是正确的,只是多了一个下划线,但是在pytorch的语法中加了下划线的一般都是inplace操作。
除此之外在其它几个博客里面看到的解决方案如下:
- 在python中, inplace operation 可以是一些 += 或 *= 导致的。比如 x += y,需要改成 x = x +y。https://blog.csdn.net/m0_66237895/article/details/134646105
- 还有的会在很多变量之后加.clone()