最近写代码,用的pytorch的DDP分布式工具。
发现一个问题,如果在代码中,模型在一次训练中有两次前项传播,如下:
model = Model()
for i, (x1, x2, y) in enumerate(trloader):
x = x.cuda()
y = y.cuda()
p1 = model(x1)
p2 = model(x2)
...
程序会爆如下错误:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
检查模型之后,设置所有in_place=False
,但是程序还是会爆出上述的错误。
最后发现是BatchNormalization
层的问题,把所有的BN层删除之后,问题就不存在了。
最好的处理方法:
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)