horovod在anaconda环境下分布式运行

1.使用horovodrun有问题,因此要使用mpirun才行
2.但是遇到的问题是,我的tensorflow是在anaconda的环境下安装的脱离了环境就不得行了,所以
要把python代替为

/anaconda3/envs/figtensorflow/bin/python

最后运行正确的完整的命令是

 mpirun -np 2 -H server1:1,server2:1 -mca btl_tcp_if_include 10.108.63.77/22 /home/ipoc345/anaconda3/envs/figtensorflow/bin/python ~/myshare/untitled/horovod_demo.py

3.遇到的问题

WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now fail.

  Local host: ipoc345-PowerEdge-R630
  PID:        71270
  Message:    connect() to 172.17.0.1:1024 failed
  Error:      Operation now in progress (115)

这里的配置就是来解决这个问题的相当于进行指定

-mca btl_tcp_if_include 10.108.63.77/22

但还是没有完整的运行成功代码,估计是代码的问题,明天再看吧

2020-05-23 00:17:25.728942: F tensorflow/core/common_runtime/scoped_allocator_mgr.cc:81] Failed to find instance 32550 in container 6 on /job:localhost/replica:0/task:0/device:CPU:0

分割线

今天发现运行这些小代码是没得问题的

import os
import socket

#from keras import backend as K
import horovod.tensorflow as hvd
import tensorflow as tf

print("hey", socket.gethostname(), ":", os.getcwd())

hvd.init()
print('Hello, rank = %d, local_rank = %d, size = %d, local_size = %d' % (hvd.rank(), hvd.local_rank(), hvd.size(), hvd.local_size()))

下面的命令可以正常运行

mpirun -np 2 -H server1:1,server2:1 -mca btl_tcp_if_include 10.108.63.77/22 /home/ipoc345/anaconda3/envs/figtensorflow/bin/python ~/myshare/horovod_test_hey.py

输出结果

Hello, rank = 0, local_rank = 0, size = 2, local_size = 1
Hello, rank = 1, local_rank = 0, size = 2, local_size = 1

分割线
运行这个命令

mpirun -np 2 -H server1:1,server2:1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.108.63.77/22 /home/ipoc345/anaconda3/envs/figtensorflow/bin/python ~/myshare/untitled/horovod_demo.py

报的错误是

2020-05-23 14:44:56.654525: F tensorflow/core/common_runtime/scoped_allocator_mgr.cc:81] Failed to find instance 32581 in container 6 on /job:localhost/replica:0/task:0/device:CPU:0

然后我在github上提问,有人说让我把openmpi升级到4.0.0试试
试试就试试,死马当活马医
先把原先安装的MPI 卸掉,如果当时安装的时候,–prefix=/usr/local/openmpi,直接把openmpi文件夹删掉就行了,但是我使用的是/usr/local,所以要下面这样安装:
找到原先安装的目录openmpi3.1.2

./configure --prefix=/usr/local
make uninstall
curl -O -L https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.0.tar.gz
tar xvzf openmpi-4.0.0.tar.gz
cd openmpi-4.0.0/
./configure --prefix=/usr/local/openmpi4.0.0
make -j8
sudo make install

配置环境

export PATH="/usr/local/openmpi4.0.0/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/openmpi4.0.0/lib"

测试

mpirun -np 2 -host server1,server2 hello_c

报错

bash: orted: command not found

据说这是因为别的主机ssh登录以后,找不到openmpi从而报的错
加上prefix参数即可

mpirun -np 2 -host server1,server2 --prefix /usr/local/openmpi4.0.0 hello_c

正常运行输出

Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI ipoc345@PowerEdge-R220 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 117)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI ipoc345@ipoc345-PowerEdge-R630 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 125)

目前openmpi版本是4.0.0,于是我重新创建了一个新的环境

conda create -n tensorflowtry pip python=3.5
source activate tensorflowtry
pip install --ignore-installed --upgrade https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/6d/dc/464f59597a5a8282585238e6e3a7bb3770c3c1f1dc8ee72bd5be257178ec/tensorflow-1.8.0-cp35-cp35m-manylinux1_x86_64.whl#sha256=d345d296aeb05eeb50d9de43a1dcb66ceaba6a2bd603f58aeefaa07b2c1bfac1
pip install numpy==1.14.0
pip install --no-cache-dir horovod

测试是否安装成功

python
import tensorflow
import horovod.tensorflow as hvd

没有报错的话,则安装成功
然后来进行运行代码,这是在10.108.61.249的服务器上运行的

mpirun -np 2 -H server1:1,server2:1 --prefix /usr/local/openmpi4.0.0  -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.108.63.77/22 /home/ipoc345/anaconda3/envs/tensorflowtry/bin/python ~/myshare/untitled/horovod_demo.py

输出结果

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2020-05-27 06:08:34.539417: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Restoring parameters from ./checkpoints/model.ckpt-20012
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2020-05-27 18:08:23.039141: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2020-05-27 06:08:39.650584: W tensorflow/core/framework/allocator.cc:101] Allocation of 67108864 exceeds 10% of system memory.
INFO:tensorflow:Saving checkpoints for 20013 into ./checkpoints/model.ckpt.
INFO:tensorflow:loss = 0.01067956, step = 20013
INFO:tensorflow:loss = 3.298196e-06, step = 20013

没有报错了,但是不知道为什么感觉打印出来的这么少估计是代码的问题吧。

展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 大白 设计师: CSDN官方博客
应支付0元
点击重新获取
扫码支付

支付成功即可阅读