Horovod 实战练习(含源码和详细配置)

0 前言

上一篇详细学习了Horovod相关知识,因此,这一篇便开始尝试Horovod的实战练习。
实验环境为矩池云的机器,里边提供了打包好的horovod镜像,因此暂未考虑如何安装的问题。

1 单机多卡

1.0 硬件配置

为节省money,使用单机双卡配置
在这里插入图片描述

1.1 源码

官方源码

import sys

import tensorflow as tf

import horovod
import horovod.tensorflow.keras as hvd


def main():
    # Horovod: initialize Horovod.
    hvd.init()

    # Horovod: pin GPU to be used to process local rank (one GPU per process)
    gpus = tf.config.experimental.list_physical_devices('GPU')
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    if gpus:
        tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

    (mnist_images, mnist_labels), _ = \
        tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % hvd.rank())

    dataset = tf.data.Dataset.from_tensor_slices(
        (tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
                 tf.cast(mnist_labels, tf.int64))
    )
    dataset = dataset.repeat().shuffle(10000).batch(128)

    mnist_model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32, [3, 3], activation='relu'),
        tf.keras.layers.Conv2D(64, [3, 3], activation='relu'),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
        tf.keras.layers.Dropout(0.25),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation='softmax')
    ])

    # Horovod: adjust learning rate based on number of GPUs.
    scaled_lr = 0.001 * hvd.size()
    opt = tf.optimizers.Adam(scaled_lr)

    # Horovod: add Horovod DistributedOptimizer.
    opt = hvd.DistributedOptimizer(
        opt, backward_passes_per_step=1, average_aggregated_gradients=True)

    # Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
    # uses hvd.DistributedOptimizer() to compute gradients.
    mnist_model.compile(loss=tf.losses.SparseCategoricalCrossentropy(),
                        optimizer=opt,
                        metrics=['accuracy'],
                        experimental_run_tf_function=False)

    callbacks = [
        # Horovod: broadcast initial variable states from rank 0 to all other processes.
        # This is necessary to ensure consistent initialization of all workers when
        # training is started with random weights or restored from a checkpoint.
        hvd.callbacks.BroadcastGlobalVariablesCallback(0),

        # Horovod: average metrics among workers at the end of every epoch.
        #
        # Note: This callback must be in the list before the ReduceLROnPlateau,
        # TensorBoard or other metrics-based callbacks.
        hvd.callbacks.MetricAverageCallback(),

        # Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final
        # accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during
        # the first three epochs. See https://arxiv.org/abs/1706.02677 for details.
        hvd.callbacks.LearningRateWarmupCallback(initial_lr=scaled_lr, warmup_epochs=3, verbose=1),
    ]

    # Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
    if hvd.rank() == 0:
        callbacks.append(tf.keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))

    # Horovod: write logs on worker 0.
    verbose = 1 if hvd.rank() == 0 else 0

    # Train the model.
    # Horovod: adjust number of steps based on number of GPUs.
    mnist_model.fit(dataset, steps_per_epoch=500 // hvd.size(), callbacks=callbacks, epochs=24, verbose=verbose)


if __name__ == '__main__':
    if len(sys.argv) == 4:
        # run training through horovod.run
        np = int(sys.argv[1])
        hosts = sys.argv[2]
        comm = sys.argv[3]
        print('Running training through horovod.run')
        horovod.run(main, np=np, hosts=hosts, use_gloo=comm == 'gloo', use_mpi=comm == 'mpi')
    else:
        # this is running via horovodrun
        main()

1.2 运行

horovodrun -np 2 -H localhost:2 python tensorflow2_keras_mnist.py
在这里插入图片描述

2 多机多卡

2.0 硬件配置

同样本着节约测试成本的考虑,采用了双机双卡的硬件配置。IP分别为192.168.1.37,192.168.1.38
在这里插入图片描述

2.1 配置环境

  1. 登录任一节点进行节点间的ssh连通。
(myconda) root@b0945000424c:/# ssh-keygen -t rsa # 一路enter即可,生成公私钥
	Generating public/private rsa key pair.
	Enter file in which to save the key (/root/.ssh/id_rsa): 
	Enter passphrase (empty for no passphrase): 
	Enter same passphrase again: 
	Your identification has been saved in /root/.ssh/id_rsa.
	Your public key has been saved in /root/.ssh/id_rsa.pub.
	The key fingerprint is:
	SHA256:rCyXYDDQwFCxn+4SsIGB4vrCyBg9ZK20yXaNlxrZeqk root@b0945000424c
	The key's randomart image is:
	+---[RSA 2048]----+
	|B+o.             |
	|+o..             |
	|+.+.             |
	|+.++.. .         |
	|.X +== .S        |
	|+ Xo=o=o         |
	|=+ +o==.         |
	|+oo.ooo          |
	| . .Eo           |
	+----[SHA256]-----+

(myconda) root@b0945000424c:/# ssh-copy-id root@192.168.1.37 #分发给其他节点,输入对应秘钥。
	/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
	The authenticity of host '192.168.1.37 (192.168.1.37)' can't be established.
	ECDSA key fingerprint is SHA256:mBoJB3tizC3nKPNphS7AKrsWtjiRt31P2VPuNys+9y4.
	Are you sure you want to continue connecting (yes/no)? yes
	/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
	/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
	root@192.168.1.37's password: 
	
	Number of key(s) added: 1
	
	Now try logging into the machine, with:   "ssh 'root@192.168.1.37'"
	and check to make sure that only the key(s) you wanted were added.
  1. 查看两台机器的网卡
    在这里插入图片描述
    在这里插入图片描述
  2. 添加环境变量
    两台机器都要执行
(myconda) root@b0945000424c:/# export NCCL_SOCKET_IFNAME=meth811,meth812
(myconda) root@b0945000424c:/# export GLOO_IFACE=meth811,meth812
(myconda) root@b0945000424c:/# export NCCL_DEBUG=INFO #可选,如需获得额外的nccl信息

2.2 运行

  1. 常规运行
    (myconda) root@b0945000424c:/mnt/Horovod# horovodrun -np 2 -H 192.168.1.37:1,192.168.1.38:1 --network-interface "192.168.1.37/24,192.168.1.38/24" python tensorflow2_keras_mnist.py
    参数说明:
    -np:后面的数字代表指定的总进程数(其实就是总GPU数)
    -H:指定各计算节点所运行卡数,格式为 IP:GPU数,多个节点之间逗号隔开,本机的信息也需要配置,所有节点都需要写入。例如 192.168.1.37:1 代表 IP 为 192.168.1.37,有 1 张GPU。
    network-interface:指定各计算节点的 IP,需要与 H 的参数对应。
    在这里插入图片描述

  2. 如需性能分析,按以下方式运行
    (myconda) root@b0945000424c:/mnt/Horovod# horovodrun -np 1 --timeline-filename /path/to/timeline.json python tensorflow2_keras_mnist.py
    在这里插入图片描述

    可以在谷歌浏览器的chrome://tracing/打开时间线文件进行一些性能的跟踪。
    在这里插入图片描述
    这里不能打开该时间线文件的原因是:源码中并未设置--timeline-filename命令参数,实际上训练完并没有生成时间线文件 在这里插入图片描述
    查看后发现确实没有这个文件,因此,需在测试源码中添加--timeline-filename这个参数

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值