测试openmpi遇到的问题
在使用mpiexec -hostfile ~/machines hostname
来测试配置是否正确时,出现
fkuner@master:/usr/local$ mpiexec -hostfile ~/machines hostname
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[17075,0],0] on node master
Remote daemon: [[17075,0],3] on node data3
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
解决方案:检查了半天也没发现,最后偶然打开master节点下ssh的authorized_keys,发现还存有master节点本身的公钥,可能是这个原因,删除后竟然成功!!
输入命令:vim ~/.ssh/authorized_keys
,删除即可
正确显示应为:
fkuner@master:~/.ssh$ mpiexec -hostfile ~/machines hostname
master
master
data1
data3
data2
master的数量为节点cpu的核数量
后来给每个虚拟机都扩充为2个核后又出现上述问题,删除了data3节点成功,不知道为什么。。。
测试pagerank出现的问题
fkuner@master:~/graphlab/release/toolkits/graph_analytics$ ./pagerank --powerlaw=10000
GRAPHLAB_SUBNET_ID/GRAPHLAB_SUBNET_MASK environment variables not defined.
Using default values
Subnet ID: 0.0.0.0
Subnet Mask: 0.0.0.0
Will find first IPv4 non-loopback address matching the subnet
ERROR: fiber_control.cpp(launch:270): Check failed: b<nworkers [1 < 1]
[master:06985] *** Process received signal ***
[master:06985] Signal: Aborted (6)
[master:06985] Signal code: (-6)
[master:06985] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f709c8d2890]
[master:06985] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f709a0e2e97]
[master:06985] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f709a0e4801]
[master:06985] [ 3] ./pagerank(_ZN8graphlab13fiber_control6launchEN5boost8functionIFvvEEEmNS_18fixed_dense_bitsetILi64EEE+0x284)[0x5638db46c054]
[master:06985] [ 4] ./pagerank(_ZN8graphlab11fiber_group6launchERKN5boost8functionIFvvEEENS_18fixed_dense_bitsetILi64EEE+0x300)[0x5638db46e940]
[master:06985] [ 5] ./pagerank(_ZN8graphlab19distributed_controlC1Ev+0x772)[0x5638db497a62]
[master:06985] [ 6] ./pagerank(main+0x91)[0x5638db39aae1]
[master:06985] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f709a0c5b97]
[master:06985] [ 8] ./pagerank(_start+0x2a)[0x5638db39d41a]
[master:06985] *** End of error message ***
Aborted (core dumped)
解决方案:原因是GranpLab要求CPU必须是多核的,所以在VirtualBox里配置一下虚拟机的CPU为多核即可
参考文章:
https://www.cnblogs.com/jasonkoo/p/3257517.html
https://www.jianshu.com/p/af02bb459ff4
https://www.cnblogs.com/leijin0211/p/6851675.html