Today's Topics:
2. Re: mpdboot problem:unable to ping local mpd (Dave Goodell)
----------------------------------------------------------------------
Message: 2
Date: Wed, 29 Sep 2010 08:05:51 -0500
From: Dave Goodell <goodell@mcs.anl.gov>
Subject: Re: [mpich-discuss] mpdboot problem:unable to ping local mpd
To: mpich-discuss@mcs.anl.gov
Message-ID: <883B6895-DA80-43CD-8C06-F9B6A149A0EF@mcs.anl.gov>
Content-Type: text/plain; charset=us-ascii
Running mpd as root is tricky. You shouldn't do it unless you really need to and really know what you are doing with it.
Better yet, just don't use mpd at all. Use hydra instead, it's much more robust: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
-Dave
On Sep 29, 2010, at 6:47 AM CDT, Albert wrote:
> I have a problem with MPICH2 on lenovo cluster when I start more than three nodes.
>
> The error info is as follows.
> Could anyone give me some advice?Thanks
>
> Albert
>
> [root@c0107 ~]# mpdboot -n 2 -f mpd.hosts
> mpdboot_c0107_0 (mpdboot 393): error trying to start mpd(boot) at 1 {'host': 'c0104', 'ncpus': 1, 'ifhn': ''}; output:
> mpdboot_c0104_1 (err_exit 415): mpd failed to start correctly on c0104
> reason: 1: unable to ping local mpd;
> invalid msg from mpd :{}:
> ** mpd may have disappeared, perhaps due to mismatched secretwords
> ** see msgs logged in syslog and /tmp/mpd2.logfile* on c0104
> last printed output from mpd before becoming a daemon:
> 37857
>
> mpdboot_c0104_1 (err_exit 421): contents of mpd logfile in /tmp:
> logfile for mpd with pid 32501
> c0104_37857: conn error in connect_lhs: No route to host
> c0104_37857 (connect_lhs 542): failed to connect to lhs at c0107 46288
> c0104_37857 (enter_ring 500): lhs connect failed
> c0104_37857 (run 215): failed to enter ring
> mpdboot_c0107_0 (err_exit 415): mpd failed to start correctly on c0107
> [root@c0107 ~]# ssh c0104
> Last login: Wed Sep 29 19:29:06 2010 from console
> [root@c0104 ~]# mpdboot -n 2 -f mpd.hosts
> [root@c0104 ~]# mpdtrace
> c0104
> c0107
> [root@c0104 ~]# mpdboot -n 3 -f mpd.hosts
> mpdboot_c0104_0 (mpdboot 406): error trying to start mpd(boot) at 2 {'host': 'c0108', 'ncpus': 1, 'ifhn': ''}; output:
> mpdboot_c0108_2 (err_exit 415): mpd failed to start correctly on c0108
> reason: 2: unable to ping local mpd;
> invalid msg from mpd :{}:
> ** mpd may have disappeared, perhaps due to mismatched secretwords
> ** see msgs logged in syslog and /tmp/mpd2.logfile* on c0108
> last printed output from mpd before becoming a daemon:
> 41819
>
> mpdboot_c0108_2 (err_exit 421): contents of mpd logfile in /tmp:
> logfile for mpd with pid 4894
> c0108_41819: conn error in connect_rhs: No route to host
> c0108_41819 (connect_rhs 602): failed to connect to rhs at 192.168.1.7 49518
> c0108_41819 (enter_ring 513): rhs connect failed
> c0108_41819 (run 215): failed to enter ring
> mpdboot_c0104_0 (err_exit 415): mpd failed to start correctly on c0104
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss@mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
mpich-discuss mailing list:unable to ping local mpd
最新推荐文章于 2021-11-21 21:48:00 发布