在调试pytorch分布式DDP代码时,因为需要节省显卡内存,想要用checkpoint机制。在使用后发现程序包以下错误:
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.
网上搜索的多数解决方法都对应了里面的1),即模型的某部分在逻辑顺序上被重复执行了,在检查代码之后未发现这一情况。随后在链接中找到了解释:
DDP does not work with torch.utils.checkpoint yet. One work around is to run forward-backward on the local model, and then manually run all_reduce to synchronize gradients after the backward pass.
以及
The reason find_unused_parameters=True does not work is because, DDP will try to traverse the autograd graph from output at the end of the forward pass when find_unused_parameters is set to True. However, with checkpoint, the autograd graphs are reconstructed during the backward pass, and hence it is not available when DDP tries to traverse it, which will make DDP think those unreachable parameters are not used in the forward pass (although they are just hidden by checkpoint). When setting find_unused_parameters=False, DDP will skip the traverse, and expect that all parameters are used and autograd engine will compute grad for each parameter exactly once.
在设置了模型
find_unused_parameters=True
后解决问题