Deep Learning:MXNet 基于docker 容器的分布式训练实践

本文链接：https://blog.csdn.net/github_37320188/article/details/99824905

本文详细介绍了如何利用MXNet的launch.py在Docker容器中进行分布式训练，探讨了不同启动方式，强调了Docker在多机分布式训练中的重要性。实践部分解析了环境变量设置和启动脚本修改，提供了多机多GPU分布式运算的条件和修改建议。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

引言

MXNet supports distributed training enabling us to leverage multiple machines for faster training.

MXNet支持分布式培训，使我们能够利用多台机器进行更快速的培训。这段话来自于 MXNet官网，说明了MXNet 支持跨越设备运行。

How to Start Distributed Training?
那么MXNet是怎样实现分布式训练的？

仔细阅读官方文档，官方文档给出了一个示例并写了这样一段话：
For distributed training of this example, we would do the following:
If the mxnet directory which contains the script image_classification.py is accessible to all machines in the cluster (for example if they are on a network file system), we can run:

../../tools/launch.py -n 3 -H hosts --launcher ssh python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync

这是个使用 ssh 方式进行分布式训练的例子。

其中使用的测试代码，来源于 https://github.com/apache/incubator-mxnet

注意，这里的 launch.py 是一个非常重要的工具，如果仔细阅读 python 源码，会发现正是由于执行它才能实现分布式训练，它目前支持了5种分布式或并发训练方式：

launch 方式

–launcher denotes the mode of communication. The options are:

ssh if machines can communicate through ssh without passwords. This
is the default launcher mode.
mpi if Open MPI is available
sge for Sun Grid Engine
yarn for Apache Yarn
local for launching all processes on the same local machine. This can be used for debugging purposes.

官网介绍了5种方式进行分布式或并行训练。这些方式都是以类似于集群管理的方式进行分布式训练。如果阅读整个测试代码，你可能还会发现其他几种集群管理方式，如mesos，不知道什么原因，官网没有介绍，也没有明确说支持 mesos 调度。

实际上，跟 tensorflow 或者 pytorch 进行分布式训练稍有不同。
在tensorflow 或者 pytorch 进行分布式训练，可能需要自己手动或者通过 mpi 工具起不同角色，如 tensorflow 中的 ps 和 worker ，pytorch 中的 rank 。
而 MXNet 起不同的角色，全部都交给 launch.py 做完了。相当于对普通用户进行了一定程度上的屏蔽。

MXNet 这样做有一定的好处，普通用户只需要关注训练脚本的编写，而不需要关注分布式计算集群如何运作。相对应的，当用户想抛开 launch.py 进行多机器多节点分布式训练时，这也会成为弊病。因为MXNet 官网并没有仔细介绍如何手动启动分布式训练。

官方单机并发训练

export COMMAND='python example/gluon/image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync'
DMLC_ROLE=server DMLC_PS_ROOT_URI=127.0.0.1 DMLC_PS_ROOT_PORT=9092 DMLC_NUM_SERVER=2 DMLC_NUM_WORKER=2 $COMMAND &
DMLC_ROLE=server DMLC_PS_ROOT_URI=127.0.0.1 DMLC_PS_ROOT_PORT=9092 DMLC_NUM_SERV