报错
exits with return code = -7
解决方案
docker run 官方质量:https://docs.docker.com/engine/reference/commandline/run/
``https://hub.yzuu.cf/microsoft/DeepSpeed/issues/4002
https://github.com/microsoft/DeepSpeed/issues/2897
Setting the shm-size to a large number instead of default 64MB when creating docker container solves the problem in my case. It appears that multi-gpu training relies on the shared memory.
I ran with this aks cluster yaml
https://stackoverflow.com/questions/43373463/how-to-increase-shm-size-of-a-kubernetes-container-shm-size-equivalent-of-doc
or docker command docker run --rm --runtime=nvidia --gpus all --shm-size 3gb imagename
it worked
Also talking with @jomayeri a bit offline it sounds like increasing docker shared memory might help with this as well. One way to bump that up is by passing something like --shm-size=“2gb” to your docker run command. The default is pretty small and can sometimes cause issues like this.
Thank you for your advice. I check the default docker shm and find it’s only 64M. When I change it up to 64g the script goes well. And I also try “deepspeed all_reduce_bench_v2.py”, it exits successfully. Appreciate it for your answer.