1.使用horovodrun有问题,因此要使用mpirun才行
2.但是遇到的问题是,我的tensorflow是在anaconda的环境下安装的脱离了环境就不得行了,所以
要把python代替为
/anaconda3/envs/figtensorflow/bin/python
最后运行正确的完整的命令是
mpirun -np 2 -H server1:1,server2:1 -mca btl_tcp_if_include 10.108.63.77/22 /home/ipoc345/anaconda3/envs/figtensorflow/bin/python ~/myshare/untitled/horovod_demo.py
3.遇到的问题
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now fail.
Local host: ipoc345-PowerEdge-R630
PID: 71270
Message: connect() to 172.17.0.1:1024 failed
Error: Operation now in progress (115)
这里的配置就是来解决这个问题的相当于进行指定
-mca btl_tcp_if_include 10.108.63.77/22
但还是没有完整的运行成功代码,估计是代码的问题,明天再看吧
2020-05-23 00:17:25.728942: F tensorflow/core/common_runtime/scoped_allocator_mgr.cc:81] Failed to find instance 32550 in container 6 on /job:localhost/replica:0/task:0/device:CPU:0
分割线
今天发现运行这些小代码是没得问题的
import os
import socket
#from keras import backend as K
import horovod.tensorflow as hvd
import tensorflow as tf
print("hey", socket.gethostname(), ":", os.getcwd())
hvd.init()
print('Hello, rank = %d, local_rank = %d, size = %d, local_size = %d' % (hvd.rank(), hvd.local_rank(), hvd.size(), hvd.local_size()))
下面的命令可以正常运行
mpirun -np 2 -H server1:1,server2:1 -mca btl_tcp_if_include 10.108.63.77/22 /home/ipoc345/anaconda3/envs/figtensorflow/bin/python ~/myshare/horovod_test_hey.py
输出结果
Hello, rank = 0, local_rank = 0, size = 2, local_size = 1
Hello, rank = 1, local_rank = 0, size = 2, local_size = 1
分割线
运行这个命令
mpirun -np 2 -H server1:1,server2:1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.108.63.77/22 /home/ipoc345/anaconda3/envs/figtensorflow/bin/python ~/myshare/untitled/horovod_demo.py
报的错误是
2020-05-23 14:44:56.654525: F tensorflow/core/common_runtime/scoped_allocator_mgr.cc:81] Failed to find instance 32581 in container 6 on /job:localhost/replica:0/task:0/device:CPU:0
然后我在github上提问,有人说让我把openmpi升级到4.0.0试试
试试就试试,死马当活马医
先把原先安装的MPI 卸掉,如果当时安装的时候,–prefix=/usr/local/openmpi,直接把openmpi文件夹删掉就行了,但是我使用的是/usr/local,所以要下面这样安装:
找到原先安装的目录openmpi3.1.2
./configure --prefix=/usr/local
make uninstall
curl -O -L https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.0.tar.gz
tar xvzf openmpi-4.0.0.tar.gz
cd openmpi-4.0.0/
./configure --prefix=/usr/local/openmpi4.0.0
make -j8
sudo make install
配置环境
export PATH="/usr/local/openmpi4.0.0/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/openmpi4.0.0/lib"
测试
mpirun -np 2 -host server1,server2 hello_c
报错
bash: orted: command not found
据说这是因为别的主机ssh登录以后,找不到openmpi从而报的错
加上prefix参数即可
mpirun -np 2 -host server1,server2 --prefix /usr/local/openmpi4.0.0 hello_c
正常运行输出
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI ipoc345@PowerEdge-R220 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 117)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI ipoc345@ipoc345-PowerEdge-R630 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 125)
目前openmpi版本是4.0.0,于是我重新创建了一个新的环境
conda create -n tensorflowtry pip python=3.5
source activate tensorflowtry
pip install --ignore-installed --upgrade https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/6d/dc/464f59597a5a8282585238e6e3a7bb3770c3c1f1dc8ee72bd5be257178ec/tensorflow-1.8.0-cp35-cp35m-manylinux1_x86_64.whl#sha256=d345d296aeb05eeb50d9de43a1dcb66ceaba6a2bd603f58aeefaa07b2c1bfac1
pip install numpy==1.14.0
pip install --no-cache-dir horovod
测试是否安装成功
python
import tensorflow
import horovod.tensorflow as hvd
没有报错的话,则安装成功
然后来进行运行代码,这是在10.108.61.249的服务器上运行的
mpirun -np 2 -H server1:1,server2:1 --prefix /usr/local/openmpi4.0.0 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.108.63.77/22 /home/ipoc345/anaconda3/envs/tensorflowtry/bin/python ~/myshare/untitled/horovod_demo.py
输出结果
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2020-05-27 06:08:34.539417: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Restoring parameters from ./checkpoints/model.ckpt-20012
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2020-05-27 18:08:23.039141: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2020-05-27 06:08:39.650584: W tensorflow/core/framework/allocator.cc:101] Allocation of 67108864 exceeds 10% of system memory.
INFO:tensorflow:Saving checkpoints for 20013 into ./checkpoints/model.ckpt.
INFO:tensorflow:loss = 0.01067956, step = 20013
INFO:tensorflow:loss = 3.298196e-06, step = 20013
没有报错了,但是不知道为什么感觉打印出来的这么少估计是代码的问题吧。