SLURM Array Job

6 篇文章 1 订阅

在 SLURM 中也有类似 UniScheduler 和 PBS 中 Array Job 的概念,不过它的实现有点意思,下面就说道说道。

首先创建一个 shell 脚本来作为 Array Job 的一个作业,内如如下

$ cat test.sh
#!/bin/sh
srun sleep 120

然后就可以通过 sbatch 命令来提交 Array Job了,如下

sbatch --array=1-10 test.sh

刚提交完还有没有任何作业运行时,使用 squeue 查看作业,可以看到新提交的 Array Job 的 JOBID是 “465_[1-10]”

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        465_[1-10]    fluent  test.sh  jhadmin PD       0:00      1 (Resources)

等到有作业已经开始运行后,再使用 squeue 查看作业时,可以看到原来的 “465_[1-10]” 变成了“465_[3-10]”,并且又多出两个运行的作业 “465_1” 和 “465_2”

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        465_[3-10]    fluent  test.sh  jhadmin PD       0:00      1 (Resources)
             465_1    fluent  test.sh  jhadmin  R       1:52      1 mycentos6x
             465_2    fluent  test.sh  jhadmin  R       1:52      1 mycentos6x

使用 “scontrol show job” 查看作业时,作业信息如下

scontrol show job
JobId=466 ArrayJobId=465 ArrayTaskId=1 JobName=test.sh
   UserId=jhadmin(502) GroupId=jhadmin(502)
   Priority=4294901719 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:01:09 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-08-14T09:22:57 EligibleTime=2015-08-14T09:22:58
   StartTime=2015-08-14T09:23:51 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=fluent AllocNode:Sid=mycentos6x:17253
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=mycentos6x
   BatchHost=mycentos6x
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/jhadmin/test.sh
   WorkDir=/home/jhadmin
   StdErr=/home/jhadmin/slurm-465_1.out
   StdIn=/dev/null
   StdOut=/home/jhadmin/slurm-465_1.out
JobId=467 ArrayJobId=465 ArrayTaskId=2 JobName=test.sh
   UserId=jhadmin(502) GroupId=jhadmin(502)
   Priority=4294901719 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:01:09 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-08-14T09:22:57 EligibleTime=2015-08-14T09:22:58
   StartTime=2015-08-14T09:23:51 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=fluent AllocNode:Sid=mycentos6x:17253
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=mycentos6x
   BatchHost=mycentos6x
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/jhadmin/test.sh
   WorkDir=/home/jhadmin
   StdErr=/home/jhadmin/slurm-465_2.out
   StdIn=/dev/null
   StdOut=/home/jhadmin/slurm-465_2.out
...
JobId=465 ArrayJobId=465 ArrayTaskId=3-10 JobName=test.sh
   UserId=jhadmin(502) GroupId=jhadmin(502)
   Priority=4294901719 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-08-14T09:22:57 EligibleTime=2015-08-14T09:22:58
   StartTime=2016-08-12T09:57:06 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=fluent AllocNode:Sid=mycentos6x:17253
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/jhadmin/test.sh
   WorkDir=/home/jhadmin
   StdErr=/home/jhadmin/slurm-465_4294967294.out
   StdIn=/dev/null
   StdOut=/home/jhadmin/slurm-465_4294967294.out

等到 1-9 的Array Job都运行完,10的 Array Job 开始运行时,使用 squeue 查看作业如下,原来的 “465_[m-n]” 消失了,只剩下 “465_10” 作业了。

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            465_10    fluent  test.sh   kongxx  R       0:53      1 mycentos6x

下面说说取消作业的操作

取消整个array job

scancel 465

取消某个array job

scancel 465_x

如果取消某个中间的 array job,比如“scancel 465_8”命令行输出如下

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    465_[6-7,9-10]    fluent  test.sh  jhadmin PD       0:00      1 ...
             465_5    fluent  test.sh  jhadmin R       1:52      1 mycentos6x
scontrol show job
JobId=552 ArrayJobId=552 ArrayTaskId=6-7,9-10 JobName=test.sh 
   UserId=jhadmin(500) GroupId=jhadmin(500)
   Priority=4294901695 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-08-27T11:25:57 EligibleTime=2015-08-27T11:25:58
   StartTime=2015-08-27T12:17:16 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=fluent AllocNode:Sid=mycentos6x:3557
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=mycentos6x
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/apps/test/slurm/test.sh
   WorkDir=/apps/test/slurm
   StdErr=/apps/test/slurm/slurm-552_4294967294.out
   StdIn=/dev/null
   StdOut=/apps/test/slurm/slurm-552_4294967294.out 

转载请以链接形式标明本文地址
本文地址:http://blog.csdn.net/kongxx/article/details/48210919

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值