RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the follow

qq_37516798

已于 2023-11-30 17:21:33 修改

阅读量506

点赞数

文章标签：深度学习 transformer

于 2023-11-30 15:56:37 首次发布

本文链接：https://blog.csdn.net/qq_37516798/article/details/134713761

版权

pytorch1.9，使用多卡训练GLIP模型时，报如下错误，而单卡却可以正常训练：
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.

解决思路：

将环境变量设置TORCH_DISTRIBUTED_DEBUG为DETAIL
export TORCH_DISTRIBUTED_DEBUG=DETAIL
训练代码中 torch.nn.parallel.DistributedDataParallel(find_unused_parameters) 中的find_unused_parameters参数设置为True
重新运行训练代码，会打印出具体出错的层或者算子名字，例如我这里是 b_attn这个算子，
在代码中找到相应的算子位置，从而定位原因，像我这里就是因为使用了pytorch的checkpoint.checkpoint（）该函数的作用是以计算换取显存消耗，即不保存很多中间计算图，较少显存开销，但是会增加计算和耗时。该函数导致了分布式多卡训练时出问题。
Checkpointing works by trading compute for memory. Rather than storing all
intermediate activations of the entire computation graph for computing
backward, the checkpointed part does not save intermediate activations,
and instead recomputes them in backward pass. It can be applied on any part
of a model.

在这里插入图片描述
5. 在config文件中设置为false,不使用checkpoint.checkpoint()函数即可

总结：出现这个报错一般是分布式训练时，每个卡上的梯度更新同步出问题导致，需要定位到具体的算子去解决

qq_37516798

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the follow

pytorch1.9，使用多卡训练GLIP模型时，报如下错误，而单卡却可以正常训练：
复制链接

扫一扫