简介
SLURM (Simple Linux Utility for Resource Management)
一种可用于大型计算节点集群的高度可伸缩和容错的集群管理器和作业调度系统
命令
查询分区和节点的状态:
(base) xueruini@nico4:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
V100* up 1-00:00:00 2 alloc nico[1-2]
Hyb up 1-00:00:00 1 idle nico3
可能遇到:
(base) xueruini@nico4:~/onion_rain/pytorch/code/ssd.pytorch$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
V100* up 1-00:00:00 1 drain nico2
V100* up 1-00:00:00 1 alloc nico1
Hyb up 1-00:00:00 1 idle nico3
STATE为drain的节点,无法alloc,此时可以使用如下命令查看原因:
(base) xueruini@nico4:~/onion_rain/pytorch/code/ssd.pytorch$ sinfo -R
REASON USER TIMESTAMP NODELIST
Kill task failed root 2020-08-18T15:47:15 nico2
查询节点信息:
(base) xueruini@nico4:~$ scontrol show node nico1
NodeName=nico1 Arch=x86_64 CoresPerSocket=16
CPUAlloc=32 CPUTot=32 CPULoad=0.16
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=nico1 NodeHostName=nico1 Version=18.08
OS=Linux 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26)
RealMemory=128000 AllocMem=0 FreeMem=215529 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=V100
BootTime=2020-07-01T21:40:55 SlurmdStartTime=2020-07-01T21:50:13
CfgTRES=cpu=32,mem=125G,billing=32
AllocTRES=cpu=32,mem=125G,billing=32
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
查询分区信息:
(base) xueruini@nico4:~$ scontrol show partition V100
PartitionName=V100
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=nico[1-2]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=64 TotalNodes=2 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
查询作业状态:
(base) xueruini@nico4:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
27 V100 bash xueruini R 1:29:41 1 nico1
33 V100 zsh heheda R 15:26 1 nico2
创建分配式任务(资源抢占):
salloc常用参数:
--help
# 显示帮助信息;
-A <account>
# 指定计费账户;
-D, --chdir=<directory>
# 指定工作目录;
--get-user-env
# 获取当前的环境变量;
--gres=<list>
# 使用gpu这类资源,如申请两块gpu则--gres=gpu:2
-J, --job-name=<jobname>
# 指定该作业的作业名;
--mail-type=<type>
# 指定状态发生时,发送邮件通知,有效种类为(NONE, BEGIN, END, FAIL, REQUEUE, ALL);
--mail-user=<user>
# 发送给指定邮箱;
-n, --ntasks=<number>
# sbatch并不会执行任务,当需要申请相应的资源来运行脚本,默认情况下一个任务一个核心,--cpus-per-task参数可以修改该默认值;
-c, --cpus-per-task=<ncpus>
# 每个任务所需要的核心数,默认为1;
--ntasks-per-node=<ntasks>
# 每个节点的任务数,--ntasks参数的优先级高于该参数,如果使用--ntasks这个参数,那么将会变为每个节点最多运行的任务数;
-o, --output=<filename pattern>
# 输出文件,作业脚本中的输出将会输出到该文件;
-p, --partition=<partition_names>
# 将作业提交到对应分区;
-q, --qos=<qos>
# 指定QOS;
-t, --time=<time>
# 设置限定时间;
取消任务:
(base) xueruini@nico4:~$ scancel 28
salloc: Job allocation 28 has been revoked.
在某个节点有任务,就可以ssh过去:
(base) xueruini@nico4:~$ ssh nico1
Linux nico1 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64
NICO NICO NI ~~~
Welcome to NICO cluster!
Current Nodes: nico[1-4]
Hardware:
nico1: 8xV100 32G, IB
nico2: 8xV100 32G, IB
nico3: 4xV100 32G, 4xP100 (for reproducing results on P100, contact @huangkz before using)
nico4: 1xP100, 1xGTX1080, 1xRADEON VII (for AMD related research, contact @laekov before using)
Spack is one good west east. We use spack to manage packages.
Use the following command to initialize spack:
source /opt/spack/share/spack/setup-env.sh
And use the following command to manage packages (environment-module not needed any more):
spack load openmpi@3.1.2%intel@19 # for example
spack find --loaded # list all loaded packages
spack unload openmpi # unload currently loaded package
If you have any questions about spack, please do not hesitate to ask YJP.
If the cluster is down, blame Harry Chen.
Last login: Thu Jul 2 12:03:44 2020 from 172.23.18.4
-bash: pyenv: command not found
(base) xueruini@nico1:~$
没任务是无法ssh过去的:
(base) xueruini@nico1:~$ ssh nico2
Access denied: user xueruini (uid=17987) has no active jobs on this node.
Connection closed by 172.23.18.2 port 22