RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the follow

pytorch1.9, 使用多卡训练GLIP模型时,报如下错误,而单卡却可以正常训练:
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.

解决思路:

  1. 将环境变量设置TORCH_DISTRIBUTED_DEBUG为DETAIL
    export TORCH_DISTRIBUTED_DEBUG=DETAIL
    训练代码中 torch.nn.parallel.DistributedDataParallel(find_unused_parameters) 中的find_unused_parameters参数设置为True
  2. 重新运行训练代码,会打印出具体出错的层或者算子名字,例如我这里是 b_attn这个算子,
    在这里插入图片描述
  3. 在代码中找到相应的算子位置,从而定位原因,像我这里就是因为使用了pytorch的checkpoint.checkpoint()该函数的作用是以计算换取显存消耗,即不保存很多中间计算图,较少显存开销,但是会增加计算和耗时。该函数导致了分布式多卡训练时出问题。
    Checkpointing works by trading compute for memory. Rather than storing all
    intermediate activations of the entire computation graph for computing
    backward, the checkpointed part does not save intermediate activations,
    and instead recomputes them in backward pass. It can be applied on any part
    of a model.

在这里插入图片描述
5. 在config文件中设置为false,不使用checkpoint.checkpoint()函数即可
在这里插入图片描述
总结:出现这个报错一般是分布式训练时,每个卡上的梯度更新同步出问题导致,需要定位到具体的算子去解决

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值