slurm跑python_学习笔记之Slurm

Slurm Workload Manager - Overview

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. Optional plugins can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.

Slurm Workload Manager - Quick Start User Guide

Slurm Workload Manager - Wikipedia

The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-sourcejob scheduler for Linux and Unix-likekernels, used by many of the world's supercomputers and computer clusters.

It provides three key functions:

allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work,

providing a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes, and

arbitrating contention for resources by managing a queue of pending jobs.

Slurm is the workload manager on about 60% of the TOP500 supercomputers.

Slurm uses a best fit algorithm based on Hilbert curve scheduling or fat tree network topology in order to optimize locality of task assignments on parallel computers.

Slurm Workload Manager - sacct

sbatch - Submit a batch script to Slurm

$ sbatch mytestsbatch.sh

Actually the second srun will start only when the previous srun is completed, so no sleep is required.

1 # =============================================================================

2 # mytestscript.sh3 # =============================================================================

4 #!/bin/sh5 date&6

7 # =============================================================================

8 # mytestsbatch.sh9 # =============================================================================

10 #!/bin/sh11 #SBATCH -N 2

12 #SBATCH -n 10

13

14 srun -n10 -o testscript1.log mytestscript.sh15 sleep 10; srun -n10 -o testscript2.log mytestscript.sh16 wait

View Code

scancel - Used to signal jobs or job steps that are under the control of Slurm.

scontrol - view or modify Slurm configuration and state.

squeue - view information about jobs located in the Slurm scheduling queue.

srun - Run parallel jobs

$ cat testscript.sh

#!/bin/sh

python mytest.py --arg test

$ chmod +x testscript.sh

$ srun -N5 -n100 testscript.sh

Run it on 5 nodes with 100 tasks

$ srun -n5 --nodelist=host1, host2 -o testscript.log testscript.sh

$ srun -n10 -o testscript.log --begin=now+2hour testscript.sh

$ srun --begin=now+10 date &

Convenient SLURM Commands | FAS Research Computing

srun: error: --begin is ignored because nodes are already allocated.

srun: error: Unable to create job step: More processors requested than permitted

In the submission script, you request resources with the #SBATCH directives, and you cannot use more resource than than in the subsequent calls to srun.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值