Horovod 实战练习（含源码和详细配置）

最新推荐文章于 2023-05-17 11:53:41 发布

ERROR_LESS

最新推荐文章于 2023-05-17 11:53:41 发布

阅读量395

点赞数 2

分类专栏： Horovod 文章标签： tensorflow 深度学习分布式

本文链接：https://blog.csdn.net/qq_47058489/article/details/125997475

版权

Horovod 专栏收录该内容

2 篇文章 1 订阅

订阅专栏

0 前言

上一篇详细学习了Horovod相关知识，因此，这一篇便开始尝试Horovod的实战练习。
实验环境为矩池云的机器，里边提供了打包好的horovod镜像，因此暂未考虑如何安装的问题。

1 单机多卡

1.0 硬件配置

为节省money，使用单机双卡配置
在这里插入图片描述

1.1 源码

官方源码

import sys

import tensorflow as tf

import horovod
import horovod.tensorflow.keras as hvd


def main():
    # Horovod: initialize Horovod.
    hvd.init()

    # Horovod: pin GPU to be used to process local rank (one GPU per process)
    gpus = tf.config.experimental.list_physical_devices('GPU')
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    if gpus:
        tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

    (mnist_images, mnist_labels), _ = \
        tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % hvd.rank())

    dataset = tf.data.Dataset.from_tensor_slices(
        (tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
                 tf.cast(mnist_labels, tf.int64))
    )
    dataset = dataset.repeat().shuffle(10000).batch(128)

    mnist_model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32, [3, 3], activation='relu'),
        tf.keras.layers.Conv2D(64, [3, 3], activation='relu'),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
        tf.keras.layers.Dropout(0.25),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation='softmax')
    ])

    # Horovod: adjust learning rate based on number of GPUs.
    scaled_lr = 0.001 * hvd.size()
    opt = tf.optimizers.Adam(scaled_lr)

    # Horovod: add Horovod DistributedOptimizer.
    opt = hvd.DistributedOptimizer(
        opt, backward_passes_per_step=1, average_aggregated_gradients=True)

    # Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
    # uses hvd.DistributedOptimizer() to compute gradients.
    mnist_model.compile(loss=tf.losses.SparseCategoricalCrossentropy(),
                        optimizer=opt,
                        metrics=['accuracy'],
                        experimental_run_tf_function=False)

    callbacks = [
        # Horovod: broadcast initial variable states from rank 0 to all other processes.
        # This is necessary to ensure consistent initialization of all workers when
        # training is started with random weights or restored from a checkpoint.
        hvd.callbacks.BroadcastGlobalVariablesCallback(0),

        # Horovod: average metrics among workers at the end of every epoch.
        #
        # Note: This callback must be in the list before the ReduceLROnPlateau,
        # TensorBoard or other metrics-based callbacks.
        hvd.callbacks.MetricAverageCallback(),

        # Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final
        # accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during
        # the first three epochs. See https://arxiv.org/abs/1706.02677 for details.
        hvd.callbacks.LearningRateWarmupCallback(initial_lr=scaled_lr, warmup_epochs=3, verbose=1),
    ]

    # Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
    if hvd.rank() == 0:
        callbacks.append(tf.keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))

    # Horovod: write logs on worker 0.
    verbose = 1 if hvd.rank() == 0 else 0

    # Train the model.
    # Horovod: adjust number of steps based on number of GPUs.
    mnist_model.fit(dataset, steps_per_epoch=500 // hvd.size(), callbacks=callbacks, epochs=24, verbose=verbose)


if __name__ == '__main__':
    if len(sys.argv) == 4:
        # run training through horovod.run
        np = int(sys.argv[1])
        hosts = sys.argv[2]
        comm = sys.argv[3]
        print('Running training through horovod.run')
        horovod.run(main, np=np, hosts=hosts, use_gloo=comm == 'gloo', use_mpi=comm == 'mpi')
    else:
        # this is running via horovodrun
        main()

1.2 运行

horovodrun -np 2 -H localhost:2 python tensorflow2_keras_mnist.py
在这里插入图片描述

2 多机多卡

2.0 硬件配置

同样本着节约测试成本的考虑，采用了双机双卡的硬件配置。IP分别为192.168.1.37,192.168.1.38。
在这里插入图片描述

2.1 配置环境

登录任一节点进行节点间的ssh连通。

(myconda) root@b0945000424c:/# ssh-keygen -t rsa # 一路enter即可，生成公私钥
	Generating public/private rsa key pair.
	Enter file in which to save the key (/root/.ssh/id_rsa): 
	Enter passphrase (empty for no passphrase): 
	Enter same passphrase again: 
	Your identification has been saved in /root/.ssh/id_rsa.
	Your public key has been saved in /root/.ssh/id_rsa.pub.
	The key fingerprint is:
	SHA256:rCyXYDDQwFCxn+4SsIGB4vrCyBg9ZK20yXaNlxrZeqk root@b0945000424c
	The key's randomart image is:
	+---[RSA 2048]----+
	|B+o.             |
	|+o..             |
	|+.+.             |
	|+.++.. .         |
	|.X +== .S        |
	|+ Xo=o=o         |
	|=+ +o==.         |
	|+oo.ooo          |
	| . .Eo           |
	+----[SHA256]-----+

(myconda) root@b0945000424c:/# ssh-copy-id root@192.168.1.37 #分发给其他节点，输入对应秘钥。
	/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
	The authenticity of host '192.168.1.37 (192.168.1.37)' can't be established.
	ECDSA key fingerprint is SHA256:mBoJB3tizC3nKPNphS7AKrsWtjiRt31P2VPuNys+9y4.
	Are you sure you want to continue connecting (yes/no)? yes
	/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
	/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
	root@192.168.1.37's password: 
	
	Number of key(s) added: 1
	
	Now try logging into the machine, with:   "ssh 'root@192.168.1.37'"
	and check to make sure that only the key(s) you wanted were added.

查看两台机器的网卡
添加环境变量
两台机器都要执行

(myconda) root@b0945000424c:/# export NCCL_SOCKET_IFNAME=meth811,meth812
(myconda) root@b0945000424c:/# export GLOO_IFACE=meth811,meth812
(myconda) root@b0945000424c:/# export NCCL_DEBUG=INFO #可选，如需获得额外的nccl信息

2.2 运行

常规运行
(myconda) root@b0945000424c:/mnt/Horovod# horovodrun -np 2 -H 192.168.1.37:1,192.168.1.38:1 --network-interface "192.168.1.37/24,192.168.1.38/24" python tensorflow2_keras_mnist.py
参数说明：
-np：后面的数字代表指定的总进程数（其实就是总GPU数）
-H：指定各计算节点所运行卡数，格式为 IP:GPU数，多个节点之间逗号隔开，本机的信息也需要配置，所有节点都需要写入。例如 192.168.1.37:1 代表 IP 为 192.168.1.37，有 1 张GPU。
network-interface：指定各计算节点的 IP，需要与 H 的参数对应。
如需性能分析，按以下方式运行
(myconda) root@b0945000424c:/mnt/Horovod# horovodrun -np 1 --timeline-filename /path/to/timeline.json python tensorflow2_keras_mnist.py

可以在谷歌浏览器的chrome://tracing/打开时间线文件进行一些性能的跟踪。

这里不能打开该时间线文件的原因是：源码中并未设置--timeline-filename命令参数，实际上训练完并没有生成时间线文件
查看后发现确实没有这个文件，因此，需在测试源码中添加--timeline-filename这个参数

ERROR_LESS

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Horovod 实战练习（含源码和详细配置）

上一篇详细学习了Horovod相关知识，因此，这一篇便开始尝试Horovod的实战练习。实验环境为矩池云的机器，里边提供了打包好的horovod镜像，因此暂未考虑如何安装的问题。
复制链接

扫一扫