node2:station101.example.com
node3:station201.example.com
1.保证各节点间相互ssh信任连接,相互登录无需密码
ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub station101
... ...
2.MPICH2安装
ftp://ftp.mcs.anl.gov/pub/mpi/mpich2-1.2.tar.gz
2.1各个节点上均安装:
tar xf mpich2-1.2.tar.gz
cd mpich2-1.2
./configure --prefix=/usr/local/mpich2
make
make install
2.2修改环境变量
vi ~/.bash_profile
PATH=$PATH:/$HOME/bin:/usr/local/mpich2/bin
source .bash_profile
2.3创建/etc/mpd.conf配置文件
vi /etc/mpd.conf
MPD_SECRETWORD=xnlinux
chmod 600 /etc/mpd.conf
2.4创建节点文件
vi /root/mpd.hosts
station101
station201
3.测试
单节点测试,启动mpich2 :
mpd&
查看启动机器:
mpdtrace
station101
退出:
mpdallexit
集群 系统测试,启动:
mpdboot -n 2 -f mpd.hosts //-n 2指定启动2个节点
mpdtrace
station101
station201
mpdallexit
圆周率计算测试:
在mpich2的源码包examples目录下的icpi.c
mpicc icpi.c -o icpi
单机测试:
./icpi
集群测试,将icpi scp到每个节点的相同目录下/root/
mpdboot -n 2 -f mpd.hosts
mpiexec -n 2 /root/icpi
4.排错
mpdcheck -pc
mpdcheck -l
mpdcheck -f mpd.hosts
节点上
mpdcheck -s
--------------------------------------------------------
安装torque ,torque是集群任务调度器
wget http://www.clusterresources.com/ ... torque-2.4.8.tar.gz
1.所有节点进行如下操作:
vi /etc/hosts
192.168.0.110 station110.example.com
192.168.0.101 station101.example.com
192.168.0.201 station201.example.com
2.服务节点进行如下操作:
vi /etc/hosts.equiv
station101.example.com
station201.example.com
3.计算节点上进行如下操作:
vi /etc/hosts.equiv
station110.example.com
4.所有节点安装rsh-server保证通信,也可使用ssh-key方式更安全
yum install rsh-server -y
chkconfig rsh on
chkconfig rlogin on
chkconfig rexec on
/etc/init.d/xinetd restart
5.服务节点上进行如下操作:
tar xf torque-2.4.8.tar.gz
cd torque-2.4.8
./configure --with-rcp=rcp --with-default-server=station110.example.com
(ssh-key方式使用--with-rcp=scp)
make
make install //torque配置目录/var/spool/torque
make packages //产生用于计算节点安装的包,用于计算节点和服务器节点相同架构的情况
cp contrib/init.d/pbs_server /etc/init.d/ //服务节点需要
scp contrib/init.d/pbs_mom station101:/etc/init.d/ //计算节点需要
scp contrib/init.d/pbs_mom station201:/etc/init.d/
scp torque-package-clients-linux-i686.sh torque-package-mom-linux-i686.sh station101:/root
scp torque-package-clients-linux-i686.sh torque-package-mom-linux-i686.sh station201:/root
./torque.setup root //设置torque管理帐户
vi /var/spool/torque/server_priv/nodes // 设定计算节点
station101.example.com
station201.example.com
qterm -t quick //停止torque
/etc/init.d/pbs_server start //启动torque
6.maui安装,maui是作业管理器,功能强于torque自带
tar xf maui-3.3.tar.gz
cd maui-3.3
./configure --with-pbs
make
make install
cp contrib/service-scripts/redhat.maui.d /etc/init.d/maui
chmod +x /etc/init.d/maui
vi /etc/init.d/maui
MAUI_PREFIX=/usr/local/maui
daemon $MAUI_PREFIX/sbin/maui
/etc/init.d/maui start
7.在计算节点上安装torque
./torque-package-com-linux-i686.sh --install
./torque-package-clients-linux-i686.sh --install
如各节点架构不同则需在不同架构的节点使用源码安装:
tar xf torque-2.4.8.tar.gz
cd torque-2.4.8
./configure --with-rcp=rcp --with-default-server=station110.example.com
make
make install_mom install_clients
所有计算节点执行下面的配置:
vi /var/spool/torque/mom_priv/config
$pbsserver station110.example.com
$logevent 255
/etc/init.d/pbs_mom start
测试:
torque调度不允许root用户,在各节点上建立相同用户名及uid帐号,将圆周率测试icpi拷贝到各节点用户主目录。
su - gao
vi job1.pbs //串行作业
-----------------------------------------------
#!/bin/bash
#PBS -N job_name
#PBS -o job.log
#PBS -e job.err
#PBS -q batch
cd /home/gao
echo running on hosts `hostname`
echo time is `date`
echo directory is $PWD
echo this job runs on the following nodes:
cat $PBS_NODEFILE
echo this job has allocated 1 node
./prog
-------------------------------------------------
vi job2.pbs //并行作业
-----------------------------------------------
#!/bin/bash
#PBS -N job_name
#PBS -o job.log
#PBS -e job.err
#PBS -q batch
#PBS -l nodes=2
cd /home/gao
echo running on hosts `hostname`
echo time is `date`
echo directory is $PWD
echo job runs on the nodes:
cat $PBS_NODEFILE
NPROCS=`wc -l < $PBS_NODEFILE`
echo this job has allocated $NPROCS nodes
mpiexec -machinefile $ PBS_NODEFILE -np $NPROCS ./prog
-------------------------------------------------------------------------
vi prog
-------------------
#!/bin/bash
echo 1000000|./icpi
作业调度:
qsub job1.pbs //提交
qstat //查看状态
pbsnodes //查看节点状态