分布式TensorFlow批量运行worker/ps

最新推荐文章于 2024-01-04 05:30:00 发布

__Sunny__

最新推荐文章于 2024-01-04 05:30:00 发布

阅读量5.4k

点赞数

分类专栏： TensorFlow

本文链接：https://blog.csdn.net/s_sunnyy/article/details/79126738

版权

本文介绍了作者在测试分布式TensorFlow时遇到的问题，即需要在多台机器上手动启动worker和ps。为解决这个问题，作者编写了一个shell脚本，能够在一台节点上执行所有节点的启动命令，从而避免了开启大量终端窗口的麻烦。脚本中包含了设置环境变量、使用nohup避免阻塞以及批量停止脚本等内容。

摘要由CSDN通过智能技术生成

最近在测试分布式TensorFlow，有一个问题一直很困扰我，就是worker和ps要分别在各节点上手动启动，然后参考GitHub上相关的问题及回答，好像对于distributed_replicated mode来说，每个节点上分别启动一个ps和一个worker会比较好，但是这样的话，如果是在32台机器上运行，就要执行64条命令，意味着要开64个xshell界面！（我的理解是这样不知道，若有不对，请大家指正）

运行的命令：

# Run the following commands on host_0 (10.0.0.1):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

# Run the following commands on host_1 (10.0.0.2):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--work