node1:station110.example.com node2:station101.example.com node3:station201.example.com 1.保证各节点间相互ssh信任连接,相互登录无需密码 ssh-keygen ssh-copy-id -i ~/.ssh/id_rsa.pub station101 ... ... 2.MPICH2安装 ftp://ftp.mcs.anl.gov/pub/mpi/mpich2-1.2.tar.gz 2.1各个节点上均安装: tar xf mpich2-1.2.tar.gz cd mpich2-1.2 ./configure --prefix=/usr/local/mpich2 make make install 2.2修改环境变量 vi ~/.bash_profile PATH=$PATH:/$HOME/bin:/usr/local/mpich2/bin source .bash_profile 2.3创建/etc/mpd.conf配置文件 vi /etc/mpd.conf MPD_SECRETWORD=xnlinux chmod 600 /etc/mpd.conf 2.4创建节点文件 vi /root/mpd.hosts station101 station201 3.测试 单节点测试,启动mpich2 : mpd& 查看启动机器: mpdtrace station101 退出: mpdallexit 集群系统测试,启动: mpdboot -n 2 -f mpd.hosts //-n 2指定启动2个节点 mpdtrace station101 station201 mpdallexit 圆周率计算测试: 在mpich2的源码包examples目录下的icpi.c mpicc icpi.c -o icpi 单机测试: ./icpi 集群测试,将icpi scp到每个节点的相同目录下/root/ mpdboot -n 2 -f mpd.hosts mpiexec -n 2 /root/icpi 4.排错 mpdcheck -pc mpdcheck -l mpdcheck -f mpd.hosts 节点上 mpdcheck -s -------------------------------------------------------- 安装torque ,torque是集群任务调度器 wget http://www.clusterresources.com/ ... torque-2.4.8.tar.gz 1.所有节点进行如下操作: vi /etc/hosts 192.168.0.110 station110.example.com 192.168.0.101 station101.example.com 192.168.0.201 station201.example.com 2.服务节点进行如下操作: vi /etc/hosts.equiv station101.example.com station201.example.com 3.计算节点上进行如下操作: vi /etc/hosts.equiv station110.example.com 4.所有节点安装rsh-server保证通信,也可使用ssh-key方式更安全 yum install rsh-server -y chkconfig rsh on chkconfig rlogin on chkconfig rexec on /etc/init.d/xinetd restart 5.服务节点上进行如下操作: tar xf torque-2.4.8.tar.gz cd torque-2.4.8 ./configure --with-rcp=rcp --with-default-server=station110.example.com (ssh-key方式使用--with-rcp=scp) make make install //torque配置目录/var/spool/torque make packages //产生用于计算节点安装的包,用于计算节点和服务器节点相同架构的情况 cp contrib/init.d/pbs_server /etc/init.d/ //服务节点需要 scp contrib/init.d/pbs_mom station101:/etc/init.d/ //计算节点需要 scp contrib/init.d/pbs_mom station201:/etc/init.d/ scp torque-package-clients-linux-i686.sh torque-package-mom-linux-i686.sh station101:/root scp torque-package-clients-linux-i686.sh torque-package-mom-linux-i686.sh station201:/root ./torque.setup root //设置torque管理帐户 vi /var/spool/torque/server_priv/nodes // 设定计算节点 station101.example.com station201.example.com qterm -t quick //停止torque /etc/init.d/pbs_server start //启动torque 6.maui安装,maui是作业管理器,功能强于torque自带 tar xf maui-3.3.tar.gz cd maui-3.3 ./configure --with-pbs make make install cp contrib/service-scripts/redhat.maui.d /etc/init.d/maui chmod +x /etc/init.d/maui vi /etc/init.d/maui MAUI_PREFIX=/usr/local/maui daemon $MAUI_PREFIX/sbin/maui /etc/init.d/maui start 7.在计算节点上安装torque ./torque-package-com-linux-i686.sh --install ./torque-package-clients-linux-i686.sh --install 如各节点架构不同则需在不同架构的节点使用源码安装: tar xf torque-2.4.8.tar.gz cd torque-2.4.8 ./configure --with-rcp=rcp --with-default-server=station110.example.com make make install_mom install_clients 所有计算节点执行下面的配置: vi /var/spool/torque/mom_priv/config $pbsserver station110.example.com $logevent 255 /etc/init.d/pbs_mom start 测试: torque调度不允许root用户,在各节点上建立相同用户名及uid帐号,将圆周率测试icpi拷贝到各节点用户主目录。 su - gao vi job1.pbs //串行作业 ----------------------------------------------- #!/bin/bash #PBS -N job_name #PBS -o job.log #PBS -e job.err #PBS -q batch cd /home/gao echo running on hosts `hostname` echo time is `date` echo directory is $PWD echo this job runs on the following nodes: cat $PBS_NODEFILE echo this job has allocated 1 node ./prog ------------------------------------------------- vi job2.pbs //并行作业 ----------------------------------------------- #!/bin/bash #PBS -N job_name #PBS -o job.log #PBS -e job.err #PBS -q batch #PBS -l nodes=2 cd /home/gao echo running on hosts `hostname` echo time is `date` echo directory is $PWD echo job runs on the nodes: cat $PBS_NODEFILE NPROCS=`wc -l < $PBS_NODEFILE` echo this job has allocated $NPROCS nodes mpiexec -machinefile $ PBS_NODEFILE -np $NPROCS ./prog ------------------------------------------------------------------------- vi prog ------------------- #!/bin/bash echo 1000000|./icpi 作业调度: qsub job1.pbs //提交 qstat //查看状态 pbsnodes //查看节点状态 |
mpich2+torque+maui并行集群搭建
最新推荐文章于 2016-04-29 16:28:20 发布