SLURM (Simple Linux Utility for Resource Management) 一种可用于大型计算节点集群的高度可伸缩和容错的集群管理器和作业调度系统
查询分区和节点的状态:
(base) xueruini@nico4:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST V100* up 1-00:00:00 2 alloc nico[1-2] Hyb up 1-00:00:00 1 idle nico3可能遇到:
(base) xueruini@nico4:~/onion_rain/pytorch/code/ssd.pytorch$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST V100* up 1-00:00:00 1 drain nico2 V100* up 1-00:00:00 1 alloc nico1 Hyb up 1-00:00:00 1 idle nico3STATE为drain的节点,无法alloc,此时可以使用如下命令查看原因:
(base) xueruini@nico4:~/onion_rain/pytorch/code/ssd.pytorch$ sinfo -R REASON USER TIMESTAMP NODELIST Kill task failed root 2020-08-18T15:47:15 nico2查询节点信息:
(base) xueruini@nico4:~$ scontrol show node nico1 NodeName=nico1 Arch=x86_64 CoresPerSocket=16 CPUAlloc=32 CPUTot=32 CPULoad=0.16 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=nico1 NodeHostName=nico1 Version=18.08 OS=Linux 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) RealMemory=128000 AllocMem=0 FreeMem=215529 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=V100 BootTime=2020-07-01T21:40:55 SlurmdStartTime=2020-07-01T21:50:13 CfgTRES=cpu=32,mem=125G,billing=32 AllocTRES=cpu=32,mem=125G,billing=32 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s查询分区信息:
(base) xueruini@nico4:~$ scontrol show partition V100 PartitionName=V100 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nico[1-2] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=64 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED查询作业状态:
(base) xueruini@nico4:~$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 27 V100 bash xueruini R 1:29:41 1 nico1 33 V100 zsh heheda R 15:26 1 nico2创建分配式任务(资源抢占): salloc常用参数:
--help # 显示帮助信息; -A <account> # 指定计费账户; -D, --chdir=<directory> # 指定工作目录; --get-user-env # 获取当前的环境变量; --gres=<list> # 使用gpu这类资源,如申请两块gpu则--gres=gpu:2 -J, --job-name=<jobname> # 指定该作业的作业名; --mail-type=<type> # 指定状态发生时,发送邮件通知,有效种类为(NONE, BEGIN, END, FAIL, REQUEUE, ALL); --mail-user=<user> # 发送给指定邮箱; -n, --ntasks=<number> # sbatch并不会执行任务,当需要申请相应的资源来运行脚本,默认情况下一个任务一个核心,--cpus-per-task参数可以修改该默认值; -c, --cpus-per-task=<ncpus> # 每个任务所需要的核心数,默认为1; --ntasks-per-node=<ntasks> # 每个节点的任务数,--ntasks参数的优先级高于该参数,如果使用--ntasks这个参数,那么将会变为每个节点最多运行的任务数; -o, --output=<filename pattern> # 输出文件,作业脚本中的输出将会输出到该文件; -p, --partition=<partition_names> # 将作业提交到对应分区; -q, --qos=<qos> # 指定QOS; -t, --time=<time> # 设置限定时间;取消任务:
(base) xueruini@nico4:~$ scancel 28 salloc: Job allocation 28 has been revoked.在某个节点有任务,就可以ssh过去:
(base) xueruini@nico4:~$ ssh nico1 Linux nico1 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 NICO NICO NI ~~~ Welcome to NICO cluster! Current Nodes: nico[1-4] Hardware: nico1: 8xV100 32G, IB nico2: 8xV100 32G, IB nico3: 4xV100 32G, 4xP100 (for reproducing results on P100, contact @huangkz before using) nico4: 1xP100, 1xGTX1080, 1xRADEON VII (for AMD related research, contact @laekov before using) Spack is one good west east. We use spack to manage packages. Use the following command to initialize spack: source /opt/spack/share/spack/setup-env.sh And use the following command to manage packages (environment-module not needed any more): spack load openmpi@3.1.2%intel@19 # for example spack find --loaded # list all loaded packages spack unload openmpi # unload currently loaded package If you have any questions about spack, please do not hesitate to ask YJP. If the cluster is down, blame Harry Chen. Last login: Thu Jul 2 12:03:44 2020 from 172.23.18.4 -bash: pyenv: command not found (base) xueruini@nico1:~$没任务是无法ssh过去的:
(base) xueruini@nico1:~$ ssh nico2 Access denied: user xueruini (uid=17987) has no active jobs on this node. Connection closed by 172.23.18.2 port 22slurm作业管理系统怎么用? SLURM使用基础教程 北京大学高性能计算使用指南