deepspeed全参数训练模型报错exits with return code = -7

最新推荐文章于 2024-04-11 14:37:39 发布

愚昧之山绝望之谷开悟之坡

最新推荐文章于 2024-04-11 14:37:39 发布

阅读量854

点赞数

分类专栏：各种报错笔记 AIGC 文章标签：笔记

本文链接：https://blog.csdn.net/qq_15821487/article/details/132437989

版权

笔记同时被 3 个专栏收录

635 篇文章 16 订阅

订阅专栏

各种报错

106 篇文章 1 订阅

订阅专栏

AIGC

17 篇文章 2 订阅

订阅专栏

报错

exits with return code = -7

解决方案

docker run 官方质量：https://docs.docker.com/engine/reference/commandline/run/

``https://hub.yzuu.cf/microsoft/DeepSpeed/issues/4002

https://github.com/microsoft/DeepSpeed/issues/2897

Setting the shm-size to a large number instead of default 64MB when creating docker container solves the problem in my case. It appears that multi-gpu training relies on the shared memory.

I ran with this aks cluster yaml
https://stackoverflow.com/questions/43373463/how-to-increase-shm-size-of-a-kubernetes-container-shm-size-equivalent-of-doc
or docker command docker run --rm --runtime=nvidia --gpus all --shm-size 3gb imagename
it worked

Also talking with @jomayeri a bit offline it sounds like increasing docker shared memory might help with this as well. One way to bump that up is by passing something like --shm-size=“2gb” to your docker run command. The default is pretty small and can sometimes cause issues like this.

Thank you for your advice. I check the default docker shm and find it’s only 64M. When I change it up to 64g the script goes well. And I also try “deepspeed all_reduce_bench_v2.py”, it exits successfully. Appreciate it for your answer.

愚昧之山绝望之谷开悟之坡

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
deepspeed全参数训练模型报错exits with return code = -7

exits with return code = -7docker run 官方质量：``Setting the shm-size to a large number instead of default 64MB when creating docker container solves the problem in my case. It appears that multi-gpu training relies on the shared memory.it worked。
复制链接

扫一扫