安装munge
https://www.cnblogs.com/haibaraai0913/p/11016885.html
munge提供组件间的认证通信机制,这个需要在所有节点安装并且启动。
源码编译安装(全部节点)
#切换root sudo su #下载安装包 wget https://download.schedmd.com/slurm/slurm-19.05.0.tar.bz2 #解压 tar -xaf slurm*tar.bz2 #切换路径 cd slurm-19.05.0 #编译安装 ./configure --enable-debug --prefix=/opt/slurm --sysconfdir=/opt/slurm/etc make && make install
在编译过程中可能会出现的错误:/usr/bin/env:"python":没有那个文件或目录
解决办法:
#添加软链 ln -s /usr/bin/python3 /usr/bin/python
新建用户并修改文件所属用户(全部节点)
#新建用户及其主目录和登录shell useradd slurm -m -s /bin/bash #给用户赋密码 passwd slurm
#新建所需文件夹
mkdir /opt/slurm/log
mkdir /opt/slurm/spool
mkdir /opt/slurm/run
#修改目录属主
chown -R slurm:slurm /opt/slurm
配置hostname
修改本机hostname
#临时修改主机名(主节点主机名为manager,子节点主机名为node1,node2) hostname manager #永久修改主机名 vim /etc/hostname #修改主机名,保存文件。重启后生效。
修改hosts
#打开hosts配置文件 vim /etc/hosts #插入以下几行,保存文件 192.168.231.128 manager 192.168.231.129 node1 192.168.231.130 node2
配置(主节点)
#从源码包拷贝配置文件夹 cp -r /opt/package/slurm-19.05.0/etc/ /opt/slurm/etc/ #修改目录属主 chown -R slurm:slurm /opt/slurm/etc #拷贝配置文件实例 cp /opt/slurm/etc/slurm.conf.example /opt/slurm/etc/slurm.conf #打开配置文件进行编辑 vim /opt/slurm/etc/slurm.conf
配置文件:
# # Example slurm.conf file. Please run configurator.html # (in doc/html) to build a configuration file customized # for your environment. # # # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=linux #集群名称 ControlMachine=manager #主节点名 ControlAddr=192.168.231.128 #主节点地址,局域网 #BackupController= #BackupAddr= # SlurmUser=slurm #主节点管理账号 #SlurmdUser=root SlurmctldPort=6817 #主节点服务默认端口号 SlurmdPort=6818 #子节点服务默认端口号 AuthType=auth/munge #组件间认证授权通信方式,使用munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/opt/slurm/spool/slurm/ctld #记录主节点状态的文件夹 SlurmdSpoolDir=/opt/slurm/spool/slurm/d #子节点状态信息文件 SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/opt/slurm/run/slurmctld.pid #主服务进程文件 SlurmdPidFile=/opt/slurm/run/slurmd.pid #子节点进程文件 ProctrackType=proctrack/pgid #监控任务与进程间的关系 #PluginDir= #FirstJobId= ReturnToService=0 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= #SelectType=select/linear FastSchedule=1 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/opt/slurm/log/slurmctld.log #主节点log日志 SlurmdDebug=3 SlurmdLogFile=/opt/slurm/log/slurmd.log #子节点log日志 JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING #JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30 # #AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES #节点名称,CPUs核数,corepersocket,threadspersocket,使用lscpu查看,realmemory实际分配给slurm内存,procs是实际CPU个数,/proc/cpuinfo里查看 state=unknown是刚启动集群的时候为unknown,之后会变成idle NodeName=manager,node1,node2 Procs=1 State=UNKNOWN #partitionname是分成control和compute,default=yes是说这个用来计算,我们设置node1/2这两台default为yes,用来计算的 PartitionName=control Nodes=manager Default=NO MaxTime=INFINITE State=UP PartitionName=compute Nodes=node1,node2 Default=Yes MaxTime=INFINITE State=UP
分发配置文件:
#拷贝主节点配置节点到子节点 scp -r etc/ slurm@192.168.231.129:/opt/slurm/ scp -r etc/ slurm@192.168.231.130:/opt/slurm/
启动集群
#主节点root用户执行 /opt/slurm/sbin/slurmctld -c /opt/slurm/sbin/slurmd -c #子节点root用户执行 /opt/slurm/sbin/slurmd -c