部署HPC集群的实施方案

JiNan.YouQuan.Soft

已于 2024-01-16 23:30:03 修改

阅读量1.3w

点赞数 18

分类专栏： HPC CAx 文章标签：云计算 linux

于 2020-12-25 21:59:59 首次发布

JiNan YouQuan Software Co. Ltd.

本文链接：https://blog.csdn.net/qq_26221775/article/details/111708789

版权

CAx 同时被 2 个专栏收录

96 篇文章

订阅专栏

HPC

22 篇文章

订阅专栏

部署HPC集群的实施方案

零、前言

本教程（实际上是一个集群部署的操作流程）旨在通过一个小型的HPC集群部署过程讲述基于Beowulf架构的HPC集群原理、主流工具等相关内容，并不对各个工具的特性及其使用方法进行深入的研究分析。

一、系统配置

1.1 网络拓扑

服务器	内网IP	计算专网IP	域名	备注
登陆节点	172.17.22.16		loginserver-chaosuan
管理节点	172.17.22.13
计算节点1	172.17.29.11	192.168.1.11	compute11	主节点
计算节点2	172.17.29.12	192.168.1.12	compute12
计算节点3	172.17.29.13	192.168.1.13	compute13
计算节点4	172.17.29.14	192.168.1.14	compute14

1.2 操作系统

登录节点：CentOS Linux release 7.3.1611

管理节点：CentOS Linux release 7.3.1611

计算节点：CentOS Linux release 7.9.2009，

二、计算节点、登录节点配置

2.1 域名设置

在登录节点、所有计算节点上执行以下命令，完成节点域名配置

vi /etc/hostname
vi /etc/hosts

2.2 免密登录

以loginserver-chaosuan为例，在登录节点、所有计算节点上执行以下命令，设置免密登录

a) 生成公钥私钥

ssh-keygen -t rsa

b) 拷贝公钥到其他节点

ssh-copy-id compute11
ssh-copy-id compute12
ssh-copy-id compute13
ssh-copy-id compute14

2.3 关闭防火墙

在登录节点、计算节点执行以下操作，

a) 查看防火墙状态

systemctl status firewalld.service

b) 关闭运行的防火墙

systemctl stop firewalld.service

c) 开机关闭防火墙

systemctl disable firewalld.service

d) 修改selinux

vi /etc/selinux/config

将SELINUX=enforcing改为：SELINUX=disabled，并执行setenforce 0使他立即生效，当然你也可以重启机器

三、NTP服务

使用compute11作为NTP服务器，loginserver-chaosuan 、compute12、compute13、compute14等作为NTP客户端。

3.1 NTP服务器

a) 安装NTP

yum install -y ntp

b) 修改ntp的配置文件

vi /etc/ntp.conf
#ntp.conf中有默认的时间服务器，我们需要注销，然后添加上我们对应的时区时间服务器
server ntp1.aliyun.com
server ntp2.aliyun.com
server ntp3.aliyun.com

#兜底时间服务器，当以上三个时间服务器不可用时，就是以本机时间作为集群机器的统一时间。
server 127.0.0.1
fudge 127.0.0.1 stratum 10

c) 主机做时间同步

ntpdate -u ntp2.aliyun.com

d) 开机启动

systemctl enable ntpd

e) 重启NTP

systemctl start ntpd
systemctl status ntpd

f) 检查NTP

ntpstat

3.2 NTP客户端

a) 安装NTP

yum install -y ntp

b) 修改ntp的配置文件

vi /etc/ntp.conf

server 172.17.29.11

c) 开机启动

systemctl enable ntpd

d) 重启NTP

systemctl start ntpd
systemctl status ntpd

e) 检查NTP

ntpstat

四、建立NFS服务

虽然在HPC场景下NFS性能被人所诟病，但是其部署比较简单，还是以NFS为例来说明。对性能有一定要求的场景，可以考虑GPFS、Lustre等并行文件系统。

使用compute11作为NFS服务器，loginserver-chaosuan 、compute12、compute13、compute14等作为NFS客户端。

4.1 NFS服务器

a) 安装RPC和NFS软件包

yum -y install rpcbind nfs-utils

b) 启动服务和设置开启启动

systemctl start rpcbind    #先启动rpc服务
systemctl enable rpcbind   #设置开机启动
systemctl start nfs-server nfs-secure-server #启动nfs服务和nfs安全传输服务
systemctl enable nfs-server nfs-secure-server
firewall-cmd --permanent --add-service=nfs #配置防火墙放行nfs服务
firewall-cmd  --reload

c) 配置共享文件目录，编辑配置文件

chmod go+w /home #开通自己组group和其他人other的写权限
vi /etc/exports

 /home 172.17.22.16(rw,async,no_root_squash)
 /home 192.168.1.0/24(rw,async,no_root_squash)

systemctl reload nfs #重新加载NFS服务，使配置文件生效

4.2 NFS客户端（计算节点）

a) 安装RPC和NFS软件包

yum -y install rpcbind nfs-utils

b) 查看服务器抛出的共享目录信息

showmount -e 192.168.1.11

vi /etc/fstab #在该文件中挂载，使系统每次启动时都能自动挂载

192.168.1.11:/home  /home       nfs    defaults 0 0

mount -a   #是文件/etc/fstab生效

d) 检查

df -Th

4.3 NFS客户端（登录节点）

a) 安装RPC和NFS软件包

yum -y install rpcbind nfs-utils

b) 查看服务器抛出的共享目录信息

showmount -e 172.17.29.11

vim /etc/fstab #在该文件中挂载，使系统每次启动时都能自动挂载

172.17.29.11:/home  /home       nfs    defaults 0 0

mount -a   #是文件/etc/fstab生效

d) 检查

df -Th

五、建立NIS服务

使用compute11作为NFS服务器，loginserver-chaosuan 、compute12、compute13、compute14等作为NFS客户端。

5.1 NIS服务器

a) 安装软件包

yum install rpcbind yp-tools ypbind ypserv

b) 设置开机域名

vi /etc/sysconfig/network

NISDOMAIN=hpc

c) 指定NIS查询的主机名称

vi /etc/yp.conf

ypserver compute11

d) 启动NIS

/usr/lib64/yp/ypinit -m

systemctl enable ypbind.service
systemctl restart ypbind.service
systemctl status ypbind.service

e) 验证

ypwhich #查看NIS服务器名称
ypcat passwd #查看NIS服务端设置的账号

f) 使用NIS数据库设置服务搜索顺序

vi /etc/nsswitch.conf，将相关行改成以下值

passwd:files nis
shadow:files nis
group:files nis
hosts:files nis dns

g) 重启NIS

systemctl restart ypbind.service

h) 添加账户

useradd hycom
useradd lsfadmin
cd /var/yp
make

5.2 NIS客户端（计算节点）

a) 安装软件包

yum install rpcbind yp-tools ypbind

b) 设置开机域名

vi /etc/sysconfig/network

NISDOMAIN=hpc

c) 指定NIS查询的主机名称：也可使用authconfig-tui配置客户端

vi /etc/yp.conf

ypserver compute11

d) 启动NIS

authconfig-tui

systemctl enable ypbind.service
systemctl restart ypbind.service
systemctl status ypbind.service

e) 验证

ypwhich
ypcat passwd

f) 使用NIS数据库设置服务搜索顺序

vi /etc/nsswitch.conf，将相关行改成以下值

passwd:files nis
shadow:files nis
group:files nis
hosts:files nis dns

g) 重启NIS

systemctl restart ypbind.service

5.3 NIS客户端（登录节点）

操作过程同上。

5.4 NIS账户免密

若NIS账户创建时指定的home目录时共享目录，则可以通过下面命令完成NIS账户SSH免密。

ssh-keygen -t rsa                                      
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys          
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys

六、编译器配置

6.1 安装编译器

安装Intel Parallel Studio XE 2019 Cluster Edition编译器，解压之后直接运行./install.sh即可。

6.2 配置环境变量

在登录节点、所有计算节点上修改~/.bash_profile文件，

vi ~/.bash_profile

6.3 编写节点列表文件

6.4 运行Intel MPI测试程序

mpirun -np 20 -f /home/hycom/machinefile ./hello-mpi

七、TORQUE

在管理节点上，通常需要安装LSF、TORQUE、SLURM等作业调度系统。虽然IBM提供了开源版本的LSF，但是对集群规模有一定的限制，建议预算有限的场景采用TORQUE、SLURM等。本文使用TORQUE进行讲解。

7.1 TORQUE服务器

使用节点loginserver-chaosuan.novalocal作为Torque服务器：

a) 安装依赖包

yum -y install libxml2-devel openssl-devel gcc gcc-c++ boost-devel libtool

b) 编译安装torque

tar -zxvf ./torque-4.1.2.tar.gz
cd ./torque-4.1.2
./configure --prefix=/opt/torque
make
make packages
make install #卸载的话是make uninstall

c) 开机启动

cp contrib/init.d/{pbs_{server,sched,mom},trqauthd} /etc/init.d
for i in pbs_server pbs_mom trqauthd; do chkconfig --add $i; chkconfig $i on; done

d) 设置环境变量

vi /etc/profile
  TORQUE=/opt/torque
  export PATH=$PATH:$TORQUE/bin:$TORQUE/sbin

e) 将root设置为管理账户

./torque.setup root

f) 启动服务

qterm -tquick
for i in pbs_server pbs_sched pbs_mom trqauthd; do service $i start; done

g) 调度节点配置计算节点的主机名和CPU核数

vi /var/spool/torque/server_priv/nodes
  compute11 np=28
  compute12 np=28
  compute13 np=28
  compute14 np=28

h) 重启服务

for i in pbs_server pbs_sched pbs_mom trqauthd; do service $i restart; done

7.2 torque客户端（计算节点）

使用compute11、compute12、compute13、compute14等作为torque客户端，以compute11为例，配置方法如下：

a) 将loginserver-chaosuan.novalocal上的安装包拷贝到compute11

scp contrib/init.d/{pbs_mom,trqauthd} root@compute11:/etc/init.d/
scp torque-package-{mom,clients}-linux-x86_64.sh root@compute11:/root/

a) .安装软件包

./torque-package-clients-linux-x86_64.sh --install
./torque-package-mom-linux-x86_64.sh --install

b) 配置环境

vi /var/spool/torque/mom_priv/config
  pbsserver loginserver-chaosuan.novalocal
  logevent 225

c) 启动pbs_mom

for i in pbs_mom trqauthd; do service $i start; done

d) 配置环境变量

vi /etc/profile
  TORQUE=/opt/torque
  export PATH=$PATH:$TORQUE/bin:$TORQUE/sbin

e) 查看节点是否正常

qnodes #或者是pbsnodes -a

八、存储服务器

实际情况通常需要根据需要配置单独的存储服务器。由于这部分涉及到具体的存储系统，请视情况自行处理。

九、集群监控

Ganglia是一个跨平台可扩展的，高性能计算系统下的分布式监控系统，主要是用来监控系统性能，如：cpu 、mem、硬盘利用率，I/O负载、网络流量情况等，通过曲线很容易见到每个节点的工作状态，对合理调整、分配系统资源，提高系统整体性能起到重要作用。网上已经有不少安装教程，不再赘述。

参考资料

[1]. Linux高性能计算集群 -- Beowulf集群

[2]. xCat

[3]. OpenHPC