本次主要记录一下如何安装slurm,基本的安装方式,不包括slurm rest API、slurm - influxdb 记录任务信息。
最新的slurm版本已经是slurm-20.11.0-0rc2.tar.bz2了,再不更新完,就会变成过时的教程了。。。
slurm - slurm rest API 和 slurm - influxdb 的安装配置方式会等到No. 5-2,No. 5-3讲解。
1 环境准备
系统环境,CentOS 7.6, 3.10.0-1062.4.1.el7.x86_64。需要安装mysql,可以参考此链接。
如果你的集群里面没有nameserver,请自行在各个节点上的 /etc/hosts 里面配置 ip 与 hostname对应关系
1个管理节点slurmcltd(192.168.0.211) hostname: cm-wsy-c16m32d200-1
2个计算节点slurmd(192.168.0.218、192.168.0.128)
本文将slurmdbd安装到管理节点上,其他节点也可。
2 创建munge和slurm用户
每个节点上执行:
$ export MUNGEUSER=991 && groupadd -g $MUNGEUSER munge
$ useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
$ export SLURMUSER=992 && groupadd -g $SLURMUSER slurm
$ useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
里面的uid和gid可以根据实际情况自行确定,不过要保证集群中的uid和gid一致。
3 安装munge
在每个节点上执行:
$ yum install epel-release openssh-clients -y
$ yum install munge munge-libs munge-devel -y
$ yum install rng-tools -y
$ rngd -r /dev/urandom
然后需要在管理节点(192.168.0.211)上生成munge.key,然后通过scp发送到各个节点上去,在管理节点上执行:
$ /usr/sbin/create-munge-key -r
$ dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
$ chown munge: /etc/munge/munge.key && chmod 400 /etc/munge/munge.key
$ scp /etc/munge/munge.key root@NODE:/etc/munge/
munge.keyf发送完毕后,要在每个节点修改munge.key的权限,然后启动munge
$ chown -R munge: /etc/munge/ /var/log/munge/ && chmod 0700 /etc/munge/ /var/log/munge/
$ systemctl enable munge
$ systemctl start munge
$ systemctl status munge
4 安装slurm的依赖,编译slurm源码包并安装
安装slurm依赖,在每个节点上执行:
$ yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad -y
$ yum install python3-pip perl-ExtUtils-MakeMaker gcc rpm-build mysql-devel json-c json-c-devel http-parser http-parser-devel -y
安装完毕后,进入slurm源码包所在的目录,博主放在了/usr/local下,各位根据实际情况,自行切换目录。在每个节点执行:
$ rpmbuild -ta --with mysql slurm-20.02.3.tar.bz2
如果执行过程中报错,需自行分析一下错误原因(多半因为依赖问题),可以发评论大家一起分析一下。命令执行完毕后,会生成rpm文件,然后安装即可。
在每个节点执行如下命令:
$ cd /root/rpmbuild/RPMS/x86_64
$ yum localinstall slurm-*.rpm -y
则slurm - slurmdbd至此安装完毕。
5 slurm的基础配置
5.1 修改配置文件
slurm的配合文件位于 /etc/slurm/ 下面
首先要创建slurm使用的配置文件,在管理节点上执行如下命令:
$ cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
$ cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
$ cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
修改slurm.conf配置文件(博主给了一个最简单的,里面的partition和node都是假的,但是能确保slurm启动成功)。 管理节点211的hostname: cm-wsy-c16m32d200-1
SlurmctldHost=cm-wsy-c16m32d200-1
#
SlurmctldDebug=info
SlurmdDebug=debug3
GresTypes=gpu
MpiDefault=none
ProctrackType=proctrack/cgroup
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
TaskPluginParam=Sched
# TIMERS
InactiveLimit=0
KillWait=15
ResumeTimeout=600
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=12
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=associations
AccountingStorageHost=cm-wsy-c16m32d200-1
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=slurm20_cluster
JobCompHost=cm-wsy-c16m32d200-1
JobCompPass=123456
JobCompPort=3306
JobCompType=jobcomp/mysql
JobCompUser=root
JobAcctGatherFrequency=1
JobAcctGatherType=jobacct_gather/linux
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
SuspendTime=70
# COMPUTE NODES
NodeName=FakeNode
PartitionName=FakePartition State=INACTIVE
修改slurmdbd.conf
# Authentication info
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DebugLevel=info
# slurmDBD info
DbdAddr=cm-wsy-c16m32d200-1
DbdHost=cm-wsy-c16m32d200-1
DbdPort=6819
SlurmUser=root
DebugLevel=verbose
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
# Database info
StorageType=accounting_storage/mysql
StorageHost=cm-wsy-c16m32d200-1
StoragePort=3306
StoragePass=123456
StorageUser=root
StorageLoc=slurm_acct_db
cgroup.conf使用默认的即可。
配置文件修改完毕后,将以上3个文件通过scp发送到各个节点即可。
5.2 修改slurm文件权限,并启动slurm
在管理节点执行:
$ mkdir /var/spool/slurmctld && chown slurm: /var/spool/slurmctld && chmod 755 /var/spool/slurmctld
$ mkdir /var/log/slurm && touch /var/log/slurm/slurmctld.log && chown slurm: /var/log/slurm/slurmctld.log
$ touch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log && chown slurm: /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
在所有计算节点上执行:
$ mkdir /var/spool/slurmd && chown slurm: /var/spool/slurmd && chmod 755 /var/spool/slurmd
$ mkdir /var/log/slurm && touch /var/log/slurm/slurmd.log && chown slurm: /var/log/slurm/slurmd.log
在所有节点上关闭防火墙:
$ systemctl stop firewalld
$ systemctl disable firewalld
在管理节点启动slurmdbd,执行:
$ systemctl enable slurmdbd.service
$ systemctl start slurmdbd.service
$ systemctl status slurmdbd.service
在管理节点启动slurmctld,执行:
$ systemctl enable slurmctld.service
$ systemctl start slurmctld.service
$ systemctl status slurmctld.service
在计算节点启动slurmd,执行:
$ systemctl enable slurmd.service
$ systemctl start slurmd.service
$ systemctl status slurmd.service
6 slurm安装完毕
至此slurm安装完毕。
如果启动服务的过程中报错
可使用调试方式启动,查看原因
$ slurmctld -Dvvvvv
$ slurmdbd -Dvvvvv
$ slurmd -Dvvvvv