InplaceABN Backward Error

最新推荐文章于 2024-07-05 17:34:00 发布

原创

最新推荐文章于 2024-07-05 17:34:00 发布 · 748 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#pytorch #深度学习 #人工智能

本文讲述了作者在修改包含InplaceABN模块的网络时遇到的运行时错误，详细解析了InplaceABN的工作原理，重点在于理解为何连续的inplace操作导致梯度计算失败。通过实际案例和GitHub issue线索，揭示了问题定位和解决方案，为类似问题的排查提供指导。

近日在对一个包含InplaceABN模块的网络进行魔改的时候，遇到了如下报错：

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 256, 7, 7]], which is output 0 of InPlaceABNBackward, is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

之前应用InplaceABN的时候，并没有研读过paper和代码，所以在解决这个问题的时候，花费了数小时，像无头苍蝇一样试错，虽然知道是连续的inplace操作引发的问题，但是没有定位到具体引发问题是在哪个block的哪块代码，居然一直在错误地方尝试clone()来解决。次日常看github的issue，才将问题原因真正搞清楚。

1. InplaceABN提供的block

ABN is standard BN + activation (no memory savings).
InPlaceABN is BN+activation done inplace (with memory savings).
InPlaceABNSyncis BN+activ