如果你只是个人学习slurm的用法,可以使用docker部署,简单快捷,【slurm】二、docker部署slurm集群和jupyter,并支持使用slurm提交作业 - 空山人语-IT技术分享学习网站
废话少说,教程送上
1、节点信息
三台服务器
控制节点:10.2.88.100 controller
计算节点1:10.2.88.101 node1
计算节点2:10.2.88.102 node2
2、修改hosts(三节点)
vim /etc/hosts
加上
10.2.88.100 controller
10.2.88.101 node1
10.2.88.102 node2
3、免密登录(控制节点)
ssh-keygen
ssh-copy-id -i /root/.ssh/id_rsa root@node1
ssh-copy-id -i /root/.ssh/id_rsa root@node2
后两步需要先输入yes,在输入密码。
4、关闭防火墙和selinux(三节点)
systemctl stop firewalld
setenforce 0
sed -i "s/SELINUX=enforcing/SELINUX=disable/" /etc/selinux/config
5、添加用户(三节点)
useradd -u 1100 munge
useradd -u 1101 slurm
6、安装软件(三节点)
控制节点:yum install -y slurm slurm-perlapi slurm-slurmctld slurm-slurmdbd munge
计算节点:yum -y install slurm slurm-perlapi slurm-slurmd munge
控制节点执行:
create-munge-key
scp /etc/munge/munge.key node1:/etc/munge/munge.key
scp /etc/munge/munge.key node2:/etc/munge/munge.key
chown munge.munge /var/log/munge/
三节点:
chown munge.munge /var/log/munge/
chown munge.munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl enable munge&&systemctl restart munge
7、创建数据库
首先准备一个mysql数据库,如果没有的话可以使用docker部署
创建一个数据库slurm_acct_db,编码为utf8的
8、配置文件(控制节点)
vim slurm.conf
ControlMachine=controller
ControlAddr=10.2.88.100
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=pmix
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=root
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/linear
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmdDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=node[0-1] CPUs=24 RealMemory=64299 CoresPerSocket=12 SocketsPerBoard=2 State=UNKNOWN
PartitionName=debug Nodes=node[0-1] Default=YES MaxTime=INFINITE State=UP
#其中,红色部分内容可以使用lscpu和free -m获取
vim cgroup.conf
CgroupAutomount=yes
ConstrainCores=no
ConstrainRAMSpace=no
vim slurmdbd.conf
AuthType=auth/munge
DebugLevel=4
DbdAddr=controller
DbdHost=controller
PidFile=/var/run/slurm/slurmdbd.pid
PurgeEventAfter=1month
PurgeJobAfter=1month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=1month
PurgeUsageAfter=1month
SlurmUser=root
StorageType=accounting_storage/mysql
StorageHost=数据库服务器ip
StoragePort=数据库端口
StoragePass=密码
StorageUser=用户名
StorageLoc=slurm_acct_db
scp slurm.conf node1:/etc/slurm/slurm.conf
scp slurm.conf node2:/etc/slurm/slurm.conf
scp slurm.conf node2:/etc/slurm/cgroup.conf
scp slurm.conf node2:/etc/slurm/cgroup.conf
9、启动(三节点)
控制节点:
systemctl restart slurmdbd
systemctl restart slurmctld
systemctl enable slurmdbd slurmctld
计算节点:
systemctl restart slurmd
systemctl enable slurmd
10、验证
三节点任选一台都可以执行
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 2 idle controller,gpu
state为idle说明完成了
11、安装nfs
共享文件夹,用于保存作业,提交作业
三节点:
yum install nfs-utils nfs-utils-lib
mkdir /home/slurm
控制节点:
vim /etc/exports
/home/slurm *(rw,sync,no_root_squash)
计算节点:
mount -t nfs controller:/home/slurm /home/slurm
vim /etc/fstab
10.2.88.100:/home/slurm /home/slurm nfs defaults 0 0
12、常用指令
创建作业:sbatch test.sh
作业详细信息:scontrol show job <job_id>
查看队列状态和资源使用情况:squeue
查看节点状态和资源使用情况:sinfo
查看节点详细信息:scontrol show node <node_name>
查看作业详细信息:sacct -j <job_id> --format=JobID,JobName,Partition,Account,AllocCPUS,MaxRSS,Elapsed,State
查看集群的详细信息:scontrol show config
文章摘自本人笔记《【slurm】一、centos7三节点部署slurm集群详细教程》,地址:【slurm】一、centos7三节点部署slurm集群详细教程 - 空山人语-IT技术分享学习网站
更多教程可以关注空山人语-IT技术分享学习网站