如何在Rocks HPC集群里安装PBS Pro开源版


PBS Pro可以算是torque的商业版本.功能强大.尤其是在其开源之后,成为了最强大的免费任务调度软件.

但是PBS Pro开源版本的预编译版本是针对CEntos7的,而Rocks 集群管理软件只支持到CEntos6.8.因此使用PBS Pro开源版本,必须要从源代码安装.安装过程中有很多的困难.因此再次记录一下过程,供大家参考.




首先,rocks集群安装的时候最好使用6.1.1,不要使用6.2;不要安装sge; OS roll最好不要使用自带的,而是用标准的centos6.7或者6.8安装盘装.


集群安装好以后


要手动修改 /etc/hosts ,将外网fqdn对应的ip改为内网ip

比如把

42.58.6.9  headnode.test.com

改为

10.0.0.1  headnode.test.com


切记每次运行rocks sync host network后都要手动改一下.否则的连不上pbs server.



从网上下载这4个包

pbspro-14.1.0.tar.gz

autoconf-2.69-12.2.noarch.rpm 

1.13.4-3.2.noarch.rpm

libedit-devel-2.11-4.20080712cvs.1.el6.x86_64.rpm 


然后放到集群共享目录,本文以/share/data/install为例

 

强制升级3个包

rpm -Uhv /share/data/install/autoconf-2.69-12.2.noarch.rpm 

rpm -Uhv /share/data/install/automake-1.13.4-3.2.noarch.rpm 

rpm -Uhv /share/data/install/libedit-devel-2.11-4.20080712cvs.1.el6.x86_64.rpm 


安装所需软件

yum --enablerepo=base  install -y gcc make rpm-build libtool hwloc-devel libX11-devel libXt-devel libedit-devel libical-devel ncurses-devel perl postgresql-devel python-devel tcl-devel tk-devel swig expat-devel openssl-devel libXext libXft  expat libedit postgresql-server python sendmail sudo tcl tk libicaly glibc

yum  --enablerepo=epel   install  hwloc hwloc-devel



cd /share/data/install/

tar -xvf pbspro-14.1.0.tar.gz

cd pbspro-14.1.0 

./autogen.sh

./configure --prefix=/opt/pbs

make

make install


安装完成,进行初始化,这里假设管理节点不进行计算任务.

/opt/pbs/libexec/pbs_postinstall

chmod 4755 /opt/pbs/sbin/pbs_iff /opt/pbs/sbin/pbs_rcp

echo "PBS_SERVER=kunanyi-admin.local" > /etc/pbs.conf

echo "PBS_START_SERVER=1" >> /etc/pbs.conf

echo "PBS_START_SCHED=1" >> /etc/pbs.conf

echo "PBS_START_COMM=1" >> /etc/pbs.conf

echo "PBS_START_MOM=0" >> /etc/pbs.conf

echo "PBS_EXEC=/opt/pbs" >> /etc/pbs.conf

echo "PBS_HOME=/var/spool/pbs" >> /etc/pbs.conf

echo "PBS_CORE_LIMIT=unlimited" >> /etc/pbs.conf

echo "PBS_SCP=/usr/bin/scp" >> /etc/pbs.conf

/etc/init.d/pbs start

. /etc/profile.d/pbs.sh


至此管理节点安装完毕

在计算节点执行下面的命令.核心就是让计算节点在管理节点已经编译过目录里执行make install来安装PBS pro的完全版本.可以把这些命令放在extend-compute.xml里面

rpm -ivf /share/data/install/libedit-devel-2.11-4.20080712cvs.1.el6.x86_64.rpm

yum --enablerepo=base  install -y gcc make rpm-build libtool hwloc-devel libX11-devel libXt-devel libedit-devel libical-devel ncurses-devel perl postgresql-devel python-devel tcl-devel tk-devel swig expat-devel openssl-devel libXext libXft  expat libedit postgresql-server python sendmail sudo tcl tk libicaly

cd /share/data/install/pbspro-14.1.0/

make install

/opt/pbs/libexec/pbs_postinstall

chmod 4755 /opt/pbs/sbin/pbs_iff /opt/pbs/sbin/pbs_rcp



echo "PBS_SERVER=kunanyi-admin.local" > /etc/pbs.conf

echo "PBS_START_SERVER=0" >> /etc/pbs.conf

echo "PBS_START_SCHED=0" >> /etc/pbs.conf

echo "PBS_START_COMM=0" >> /etc/pbs.conf

echo "PBS_START_MOM=1" >> /etc/pbs.conf

echo "PBS_EXEC=/opt/pbs" >> /etc/pbs.conf

echo "PBS_HOME=/var/spool/pbs" >> /etc/pbs.conf

echo "PBS_CORE_LIMIT=unlimited" >> /etc/pbs.conf

echo "PBS_SCP=/usr/bin/scp" >> /etc/pbs.conf

. /etc/profile.d/pbs.sh

/etc/init.d/pbs start



当所有节点安装了pbs之后,在管理节点添加计算机节点.这里以hc-002为例子.


qmgr -c "create node hc-002"


之后可以使用下面命令检测计算节点.

pbsnodes -a 


在之后就是配置pbs,例如

qmgr

create queue workq

set queue workq queue_type = Execution

set queue workq enabled = True

set queue workq started = True

set server scheduling = True

set server default_queue = workq

set server log_events = 511

set server mail_from = adm

set server query_other_jobs = True

set server resources_default.ncpus = 1

set server scheduler_iteration = 600

set server resv_enable = True

set server node_fail_requeue = 310

set server max_array_size = 10000

set server default_chunk.ncpus=1


set server default_queue = workq

set server scheduling = True

set server acl_host_enable = True

set server acl_hosts = kunanyi-admin

set server flatuid = True



set server acl_users ="+test@kunanyi-admin.hpc.tpac.org.au,+test"

set queue workq acl_users ="+test@kunanyi-admin.hpc.tpac.org.au,+test"