Horovod分布式训练实战-CSDN博客

horovod学习笔记

简介
安装方式
使用
注意事项

简介

Horovod 是一套面向 TensorFlow 的分布式训练框架，由 Uber 构建并开源，它的发展吸取了Facebook “Training ImageNet In 1 Hour” 与百度 “Ring Allreduce” 的优点，可为用户实现分布式训练提供帮助。Horovod 能够简化并加速分布式深度学习项目的启动与运行。通过利用消息传递接口（简称 MPI）实现应用环状规约，显著提升 TensorFlow 模型的实用性与性能表现。

安装方式

校验安装结果

查看安装的软件包列表

要查看安装到环境中的软件包的完整列表，请运行以下命令。

$ conda activate $ENV_PREFIX # optional if environment already active
$ conda list

最终校验

构建Conda环境后，请使用以下命令检查Horovod是否已构建为支持深度学习框架TensorFlow，PyTorch，Apache MXNet以及contollers MPI和Gloo。

$ conda activate $ENV_PREFIX # optional if environment already active
$ horovodrun --check-build

显示这个就算成功了

Horovod v0.19.4:
Available Frameworks:
    [X] TensorFlow
    [X] PyTorch
    [X] MXNet
Available Controllers:
    [X] MPI
    [X] Gloo
Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo

补充

如果出现提示 no module named ‘***’ ，直接 pip install ** 就可以解决
另外，根据我查询的资料如果环境中只有一个python版本，使用 pip 和 pip3 安装效果应该是一样的（理论上) ，但是就我配环境的经历，有的时候就是 pip 安装失败，换pip3 安装就是成功，所以要是 pip 不行，尝试一下 pip3

使用

修改代码

要使用Horovod，请在程序中添加以下内容：

运行 hvd.init（）初始化Horovod。
将每个GPU固定到一个进程，以避免资源争用。
通常每个进程设置一个GPU，将其设置为本地等级。服务器上的第一个进程将被分配第一个GPU，第二个进程将被分配第二个GPU，依此类推。
通过工人人数来衡量学习率。
同步分布式培训中的有效批处理规模是根据工人人数来衡量的。学习率的提高弥补了批量大小的增加。
将 optimizer 包装在 hvd.DistributedOptimizer 中。
分布式优化器将梯度计算委派给原始优化器，使用 allreduce 或 allgather 对梯度求平均，然后应用这些平均梯度。
从 rank 0 的状态广播到所有其他进程。
当使用随机权重开始训练或从检查点恢复训练时，这是确保所有 worker 进行一致初始化的必要步骤。
修改您的代码以仅在 worker 0 上保存检查点，以防止其他工作程序破坏它们。

horovod官方提供的 tensorflow相关示例

TensorFlow v2 Example (from the MNIST example)

import tensorflow as tf
import horovod.tensorflow as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

# Build model and dataset
dataset = ...
model = ...
loss = tf.losses.SparseCategoricalCrossentropy()
opt = tf.optimizers.Adam(0.001 * hvd.size())

checkpoint_dir = './checkpoints'
checkpoint = tf.train.Checkpoint(model=model, optimizer=opt)

@tf.function
def training_step(images, labels, first_batch):
    with tf.GradientTape() as tape:
        probs = mnist_model(images, training=True)
        loss_value = loss(labels, probs)

    # Horovod: add Horovod Distributed GradientTape.
    tape = hvd.DistributedGradientTape(tape)

    grads = tape.gradient(loss_value, mnist_model.trainable_variables)
    opt.apply_gradients(zip(grads, mnist_model.trainable_variables))

    # Horovod: broadcast initial variable states from rank 0 to all other processes.
    # This is necessary to ensure consistent initialization of all workers when
    # training is started with random weights or restored from a checkpoint.
    #
    # Note: broadcast should be done after the first gradient step to ensure optimizer
    # initialization.
    if first_batch:
        hvd.broadcast_variables(mnist_model.variables, root_rank=0)
        hvd.broadcast_variables(opt.variables(), root_rank=0)

    return loss_value

# Horovod: adjust number of steps based on number of GPUs.
for batch, (images, labels) in enumerate(dataset.take(10000 // hvd.size())):
    loss_value = training_step(images, labels, batch == 0)

    if batch % 10 == 0 and hvd.local_rank() == 0:
        print('Step #%d\tLoss: %.6f' % (batch, loss_value))

# Horovod: save checkpoints only on worker 0 to prevent other workers from
# corrupting it.
if hvd.rank() == 0:
    checkpoint.save(checkpoint_dir)

运行

通常，每个进程将分配一个GPU，因此，如果服务器具有4个GPU，则将运行4个进程。在horovodrun中，进程数由-np标志指定。
命令格式为 mpirun command
To run on a machine with 4 GPUs:
这里 -np 后边跟着的数字是GPU的总数，后边localhost或者某个server后边跟着的是它的GPU数量

$ horovodrun -np 4 -H localhost:4 python train.py

To run on 4 machines with 4 GPUs each:

$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py

You can also specify host nodes in a host file. For example:

$ cat myhostfile

aa slots=2
bb slots=2
cc slots=2

This example lists the host names (aa, bb, and cc) and how many “slots” there are for each. Slots indicate how many processes can potentially execute on a node. This format is the same as in mpirun command.

To run on hosts specified in a hostfile:

$ horovodrun -np 6 -hostfile myhostfile python train.py

Failures due to SSH issues

The host where horovodrun is executed must be able to SSH to all other hosts without any prompts.

If horovodrun fails with a permission error, verify that you can ssh to every other server without entering a password or answering questions like this:

The authenticity of host ’ ()’ can’t be established. RSA key fingerprint is xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx. Are you sure you want to continue connecting (yes/no)?

To learn more about setting up passwordless authentication, see this page.

To avoid The authenticity of host ’ ()’ can’t be established prompts, add all the hosts to the ~/.ssh/known_hosts file using ssh-keyscan:

$ ssh-keyscan -t rsa,dsa server1 server2 > ~/.ssh/known_hosts

注意事项

Horovod培训脚本未作为Python脚本启动。例如，您不能使用python train.py运行此培训脚本。相反，采用特殊的CLI命令Horovod

$ horovodrun -np 4 -H localhost:4 python train.py

检查ssh localhost 和ssh 其他服务器

要做到的效果是都必须不需要输入任何密码或者输入yes，就可以直接连接
使用的方法是生成密钥和配置密钥

检查CUDA_HOME变量

在学姐配置的conda环境之中，添加了这个环境变量，但是在服务器上并没有，这也是为嘛我这边的就是一直报错找不见，使用以下命令查询是否有这个环境变量

$ echo $CUDA_HOME

export 命令显示当前系统定义的所有环境变量
echo $PATH命令输出当前的PATH环境变量的值

如果发现确实没有CUDA_HOME环境变量
修改.bashrc文件即可

各种报错

WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.130465). Check your callbacks.
大概意思是模型比较小，导致每次输出的时间消耗大于跑一个epoch，所以会有提醒
DistributedOptimizer()
DistributedOptimizer() got an unexpected keyword argument ‘average_aggregated_gradients’
DistributedOptimizer() got an unexpected keyword argument ‘backward_passes_per_step’
horovod版本问题，旧版本的api没有这两个参数