[Pytorch] DDP执行报错 RuntimeError: Expected to mark a variable ready only once.

在调试pytorch分布式DDP代码时,因为需要节省显卡内存,想要用checkpoint机制。在使用后发现程序包以下错误:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

网上搜索的多数解决方法都对应了里面的1),即模型的某部分在逻辑顺序上被重复执行了,在检查代码之后未发现这一情况。随后在链接中找到了解释:

DDP does not work with torch.utils.checkpoint yet. One work around is to run forward-backward on the local model, and then manually run all_reduce to synchronize gradients after the backward pass.

以及

The reason find_unused_parameters=True does not work is because, DDP will try to traverse the autograd graph from output at the end of the forward pass when find_unused_parameters is set to True. However, with checkpoint, the autograd graphs are reconstructed during the backward pass, and hence it is not available when DDP tries to traverse it, which will make DDP think those unreachable parameters are not used in the forward pass (although they are just hidden by checkpoint). When setting find_unused_parameters=False, DDP will skip the traverse, and expect that all parameters are used and autograd engine will compute grad for each parameter exactly once.

在设置了模型

find_unused_parameters=True

后解决问题

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值