使用 Slurm 配置 Nvidia GPU 集群

记录下使用 slurm 搭建 gpu 集群的过程,以下命令都是用 root 用户执行,切记。

安装

编译 munge

wget https://github.com/dun/munge/releases/download/munge-0.5.15/munge-0.5.15.tar.xz
tar xvf munge-0.5.15.tar.xz
cd munge-0.5.15
./configure --prefix=/usr/local/munge-0.5.15 
# 如果报错
# centos: yum install -y libgcrypt-devel libgcrypt11-dev
# ubuntu: apt-get install libssl-dev
make -j
make install
cd ..
ln -s munge-0.5.15 munge

编译 pmix

# 编译 hwloc
mkdir hwloc-2 && cd hwloc-2
wget https://download.open-mpi.org/release/hwloc/v2.9/hwloc-2.9.1.tar.gz
tar xvf hwloc-2.9.1.tar.gz 
cd hwloc-2.9.1
./configure --prefix=/usr/local/hwloc-2
make -j
make install
cd ../..
ln -s hwloc-2 hwloc

# 编译 libevent
wget https://github.com/libevent/libevent/releases/download/release-2.1.12-stable/libevent-2.1.12-stable.tar.gz
tar xvf libevent-2.1.12-stable.tar.gz 
cd libevent-2.1.12
./configure --prefix=/usr/local/libevent-2.1.12
make -j
make install
cd ..
ln -s libevent-2.1.12 libevent 

# 编译 pmix
mkdir pmix-4 && cd pmix-4
wget https://github.com/openpmix/openpmix/releases/download/v4.2.7/pmix-4.2.7.tar.gz
tar xvf pmix-4.2.7.tar.gz
cd pmix-4.2.7
./configure --prefix=/usr/local/pmix-4 --with-hwloc=/usr/local/hwloc --with-libevent=/usr/local/libevent
make -j
make install
cd ../..
ln -s pmix-4 pmix

编译 openmpi

# 编译 ucx
wget https://github.com/openucx/ucx/releases/download/v1.15.0/ucx-1.15.0.tar.gz
tar xvf ucx-1.15.0.tar.gz 
cd ucx-1.15.0
./configure --prefix=/usr/local/ucx-1.15.0
make -j
make install
cd ..
ln -s ucx-1.15.0 ucx

# 编译 openmpi 
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.6.tar.gz
tar xvf openmpi-4.1.6.tar.gz
cd openmpi-4.1.6
./configure --prefix=/usr/local/openmpi-4.1.6 --with-pmix=/usr/local/pmix --with-ucx=/usr/local/ucx --with-hwloc=/usr/local/hwloc --with-libevent=/usr/local/libevent
make -j
make install
cd ..
ln -s openmpi-4.1.6 openmpi

编译 Slurm

wget https://download.schedmd.com/slurm/slurm-23.02.6.tar.bz2
tar xvf slurm-23.02.6.tar.bz2
cd slurm-23.02.6
./configure --prefix=/usr/local/slurm-23.02.6 --with-pmix=/usr/local/pmix --with-munge=/usr/local/munge --with-hwloc=/usr/local/hwloc --with-ucx=/usr/local/ucx
make -j
make install
cd ..
ln -s slurm-23.02.6 slurm

添加环境变量

参考 slurm集群安装与踩坑详解 | 我是谁 (yuhldr.github.io)

配置(node 数目为 2)

slurm.conf 参见:slurm.conf (github.com)

gres.conf 参见:gres.conf (github.com)

启动

# control machine: /var/log/slurmctld.log 查看是否报错
munged
munge
slurmctld
slurmd

# node:/var/log/slurmd.log 查看是否报错
munged
munge
slurmd

检查

sinfo # 打印信息 是 idle 说明配置成功
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      2    idle 8A800-[1-2]

# 如果遇到 "undrain" 的情况
scontrol update nodename=node10 state=idle

集群训练模型

使用集群用 xtuner 微调 yi-34b 为例:

srun -p debug --job-name=xtuner --nodes=2 --gres=gpu:8 --ntasks-per-node=8 --kill-on-bad-exit=1 xtuner train yi_34b_qlora_alpaca_enzh_e3 --launcher slurm

参考

  1. slurm集群安装与踩坑详解 | 我是谁 (yuhldr.github.io)

  2. Slurm 20.02.3 集群添加gpu节点 No. 2-1_slurm 添加节点-CSDN博客

  3. Slurm | NVIDIA Developer

  4. How to "undrain" slurm nodes in drain state - Stack Overflow

  5. Slurm Workload Manager - gres.conf --- Slurm 工作负载管理器 - gres.conf (schedmd.com)

  6. Slurm Workload Manager - Generic Resource (GRES) Scheduling --- Slurm 工作负载管理器 - 通用资源 (GRES) 调度 (schedmd.com)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

阿姆姆姆姆姆姆姆

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值