1. Introduction
1.1 what is a cluster?
2. Connecting to the system
3. Command line interface
4. Software environment
5. Data management
6. Job Management
6.1 Partitions and Limits
6.2 Resource allocation: Jobs and jobscripts
For each job a user has to specify what amount of resources are required for the job
to run properly. The following resources can be requested:
- A number of nodes/computers
- A number CPU cores per node
- An amount of memory per core or node
- The amount of time the job needs
- Special hardware classes/partitions
Special features like GPUs
High resource usage will lower your priority for new jobs. A period of low activity will cause your priority for new jobs to increase again
.
所以说,也不是所有的任务都申请最牛逼的硬件,没必要,反而要多排队. 一段时间的low activity可以让你新工作的优先级再次变高. 从而保证系统不是老是一个人在用.
SLURM
SLURM
这是资源调度器的名字, 参考: SLURM workload manager.
Job scripts
#SBATCH
These lines are interpreted by the SLURM workload manager.
This means that jobs need to be specified according to SLURM syntax rules
and that the SLURM commands
have to be used.
然后就是指定需要的node
, core
, memory
, time
.
Time
- 分钟
- 分钟:秒钟
- 时钟:分钟:秒钟
- 天数:时钟
- 天数:时钟:分钟
- 天数:时钟:分钟:秒钟
如果时间不够用,就算job没有停止,也会被终止。 regardless of its state.
Wall clock time is the normal time that passes by and can be measured using for example two readings from a wall clock. CPU time is the fraction of this time spent by a CPU on calculations.
挂钟时间是经过的正常时间,可以使用例如挂钟的两个读数来测量。 CPU时间是CPU在计算上花费的时间的一部分。
如果单CPU, CPU这个时间一般都是短于Wall clock time的。
但是对于多CPUs,
But, to make things more complex a program can make use of multiple CPU cores. In that case the CPU time is accumulated on all these CPU cores and will therefore normally progress much faster than on a single CPU, and normally increase much faster than the time that has passed on the wall clock in the same period.
但是,为了使事情变得更复杂,程序可以使用多个CPU内核。 在那种情况下,CPU时间是在所有这些CPU内核上累积的,因此通常比单个CPU上的处理速度快得多,并且通常比同时间在挂钟上经过的时间要快得多。
多个CPU的话,因为多个的加速,所以最终CPU这个时间反而比挂钟时间要短。
Nodes and cores
cpus-per-task
多thread的用,e.g. MATLAB.
这些数该怎么指定,取决于要跑的程序的能力capabilities.
两者只有一个可以生效,互斥的。
Only for programs that can use multiple CPU cores the number of tasks and/or cpus per task may be set higher than 1.
只要能用多个CPU cores的程序,它的ntasks
, cpu-per-task
才会设置为大于1的数。
The number of nodes
can only be higher than 1 for software that is capable of running on multiple physical computers, using network communication.
只有能在多个Physical computers上运行的程序才能指定多个Nodes,他们之间采用Network来通信。
Very Important
如果不不知道你的程序是否running in parallel, 不要request multiple cores, or nodes.
6.3 Running jobs
6.3.1 Submitting jobs
说白了,就是可以在命令行里对训练脚本指定一些东西,好处就在于:不用更改脚本, 但是这样的话, 就很难有记录看到到底改了什么.
此外,在命令行里指定的,和在脚本里指定的, 在命令行里的优先级更高, 会把之前在脚本里的给覆盖掉.
6.3.2 Job environment
Jobs will always start in the same directory as from which they were submitted. Furthermore, your complete environment during job submission
will be stored and transferred to the job; this includes all loaded modules
. 这是指我们在Terminal一个个的module load module-***
所导入的module
.
作业将始终以与提交作业相同的目录开始。 此外,您在提交工作期间的完整环境将被存储并转移到工作中; 这包括所有已加载的模块, 有的也不需要导入啊, 这是就有了$SBATCH --export=var1, var2, ..., varN(这是真正要导入的)
.
Special values that can be given are ALL (default) and NONE
. By using the latter, the job will start with a clean/empty environment.
如上所示, 我们这里在没有module load Miniconda3
的时候是没有任何module
的. 说明我们就是要用纯粹和干净的 jd_ReID
环境.
6.4 Cancelling jobs
6.5
6.6
7. Examples
7.1
7.2
7.3
7.4
7.5
7.6
7.7 Python
7.7.1 Submitting single jobs
Single CPU
Multiple CPUs
GPU
Multiple Nodes
0 节点分类和属性
1. Module environment
The module system has been installed to help you setting up the correct environment for the different software packages.
就是靠这个模块系统,来为不同的软件包来配置正确的环境
比如说我要用python这个package,但是我需要某个版本的pytorch,以及numpy,我们就用这个来做,从而select a specific version of a software package.
This also allows the user to select a specific version of a software package.
Module command
The environment can be set using the “module” command(ml是module的缩写). Some useful available options for the command are:
环境可以通过module command 来设置。
avail list the available software modules
对于这个命令的话,更具体的用法如下:
module avail
这是check a full overview of the available software
module avail boost
这是only interested in certain software you can use a keyword, Any software matching this string will be shown.
module avail pytorch
那现在查到了确实有好几个pytorch,但是怎么确定用哪个呢?
如果只是打pytorch, 那预计会用默认的PyTorch/1.3.1-Python-3.7.4
怎么在环境里添加一个module
In order to add a module to your environment you can use the option load
or add
. In order to load Python 3.5.1
into your environment you can use:
module load Python/3.5.1-foss-2016a
然后的话,因为3.5.1-foss-2016a is the default 然后其实可以简化为
module load Python
以下是载入我要用的pytorch的例子
module load PyTorch/1.3.1-fosscuda-2019b-Python-3.7.4
All modules on which this module depends on will also be loaded. 这一点老牛逼了,Pytorch依赖的包会自动被载入。
检查都什么module在我们的环境中了?
To show which modules are currently in your environment the option list can be used:
module list
2. 如何进行多个GPU训练?
看了wiki, wiki界面怎么显示的是几个Sh
文件。 下次真正需要用到多个GPU时候,再研究这个.