现有20个树梅派安装了Ubuntu20.04系统(arm64),但是连上了实验室的网线后,在同一网段下只可以搜得到16个,并成功登录。问题来了:那4个该怎么找到? 可以把这20个逐个拆下,每拆一个,扫描一下网络内的主机,看看是不是有哪个IP掉线了,有的话,IP就对应上了。但需要注意的是,实验室的IP分配是动态的,同一根网线和同一个MAC地址的主机共同决定了一个IP,有一个变化时,IP都会变。(年前测试是这样的)。
总之,这样的办法太麻烦了,我凌晨3点多已经没有兴趣再搞了。过年后开学,实验室里有人拔掉了树梅派的网线或者电源线,现在扫描网络内就剩下12个了!
好吧
开始实验
- 如果采用grpc的通信后端,那么只需要发布代码到所有节点(或者采用后面介绍的
NFS
),然后每个节点(很烦啊)安装所需的依赖,在执行的时候,每个节点(依然很烦)依次运行,而且server和client端的代码都不相同,执行时给的参数也有差异,总之,RPC太麻烦了,虽然很贴近实际的生产环境。 - 于是,采用MPI通信后端,简单的一P。上教程:
需要提前声明的是,要带着脑子去执行以下命令,很多时候都是要做适当修改的,比如server_hostname ,username,worker等,还有一点需要注意,在配置NFS的/etc/exports时,要给所有节点可写权限(rw),这样之后在每个节点安装的时候,才可以进行下去。
Prequest
You have configure the hostname and hostfile in your cluster for convinent. Take Ubuntu for instance:
- hostname: /etc/hostname
- hostfile: /etc/hosts . Add all the nodes’ name and corresponding IP
1. ssh
主节点把自己的公钥发给其他节点,放到authoried_keys里
ssh-copy-id .ssh/id_rsa.pub worker
2. NFS
2.1 NFS-Server
sudo apt-get install nfs-kernel-server
mkdir cloud
# sudo vim /etc/exports
/home/username/cloud worker11(rw,sync,no_subtree_check) worker10(rw,sync,no_subtree_check) worker9(rw,sync,no_subtree_check)
sudo exportfs -a
Run the above command, every time you make a change to /etc/exports
.
If required, restart the nfs
server
sudo service nfs-kernel-server restart
2.2 NFS-Worker
sudo apt-get install -y nfs-common
mkdir cloud
And now, mount the shared directory like
#sudo mount -t nfs server_hostname:/home/username/cloud ~/cloud
sudo mount -t nfs worker12:/home/ubuntu/cloud ~/cloud
To check the mounted directories,
df -h
To make the mount permanent so you don’t have to manually mount the shared directory everytime you do a system reboot, you can create an entry in your file systems table - i.e., /etc/fstab
file like this:
$ cat /etc/fstab
#MPI CLUSTER SETUP
manager:/home/mpiuser/cloud /home/mpiuser/cloud nfs
3. Run
mpiexec -n 4 --host server:2,worker1:3,worker2:3 python run.py
实际我是这样做的,因为Ubuntu在树梅派上的话,python3而非python
mpiexec -n 7 --host worker12:1,worker11:2,worker10:2,worker9:2 python3 run.py
至此,问题又来了,每个树梅派的内存都是4G,4核心逻辑CPU。
为什么执行的时候,总是,报以下错误:
11个节点:
ubuntu@worker12:~/cloud/APPFL/examples$ mpiexec -n 11 --host worker12:1,worker11:1,worker10:1,worker9:1,worker8:1,worker7:1,worker6:1,worker5:1,worker4:1,worker3:1,worker2:1 python3 cifar10.py
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 10 with PID 95259 on node worker2 exited on signal 9 (Killed).
--------------------------------------------------------------------------
ubuntu@worker12:~/cloud/APPFL/examples$ htop
10个节点:
ubuntu@worker12:~/cloud/APPFL/examples$ mpiexec -n 10 --host worker12:1,worker11:1,worker10:1,worker9:1,worker8:1,worker7:1,worker6:1,worker5:1,worker4:1,worker3:1 python3 cifar10.pyFiles already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 5 with PID 54279 on node worker7 exited on signal 9 (Killed).
--------------------------------------------------------------------------
ubuntu@worker12:~/cloud/APPFL/examples$
该怎么解决?