记录下使用 slurm 搭建 gpu 集群的过程,以下命令都是用 root 用户执行,切记。
安装
编译 munge
wget https://github.com/dun/munge/releases/download/munge-0.5.15/munge-0.5.15.tar.xz
tar xvf munge-0.5.15.tar.xz
cd munge-0.5.15
./configure --prefix=/usr/local/munge-0.5.15
# 如果报错
# centos: yum install -y libgcrypt-devel libgcrypt11-dev
# ubuntu: apt-get install libssl-dev
make -j
make install
cd ..
ln -s munge-0.5.15 munge
编译 pmix
# 编译 hwloc
mkdir hwloc-2 && cd hwloc-2
wget https://download.open-mpi.org/release/hwloc/v2.9/hwloc-2.9.1.tar.gz
tar xvf hwloc-2.9.1.tar.gz
cd hwloc-2.9.1
./configure --prefix=/usr/local/hwloc-2
make -j
make install
cd ../..
ln -s hwloc-2 hwloc
# 编译 libevent
wget https://github.com/libevent/libevent/releases/download/release-2.1.12-stable/libevent-2.1.12-stable.tar.gz
tar xvf libevent-2.1.12-stable.tar.gz
cd libevent-2.1.12
./configure --prefix=/usr/local/libevent-2.1.12
make -j
make install
cd ..
ln -s libevent-2.1.12 libevent
# 编译 pmix
mkdir pmix-4 && cd pmix-4
wget https://github.com/openpmix/openpmix/releases/download/v4.2.7/pmix-4.2.7.tar.gz
tar xvf pmix-4.2.7.tar.gz
cd pmix-4.2.7
./configure --prefix=/usr/local/pmix-4 --with-hwloc=/usr/local/hwloc --with-libevent=/usr/local/libevent
make -j
make install
cd ../..
ln -s pmix-4 pmix
编译 openmpi
# 编译 ucx
wget https://github.com/openucx/ucx/releases/download/v1.15.0/ucx-1.15.0.tar.gz
tar xvf ucx-1.15.0.tar.gz
cd ucx-1.15.0
./configure --prefix=/usr/local/ucx-1.15.0
make -j
make install
cd ..
ln -s ucx-1.15.0 ucx
# 编译 openmpi
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.6.tar.gz
tar xvf openmpi-4.1.6.tar.gz
cd openmpi-4.1.6
./configure --prefix=/usr/local/openmpi-4.1.6 --with-pmix=/usr/local/pmix --with-ucx=/usr/local/ucx --with-hwloc=/usr/local/hwloc --with-libevent=/usr/local/libevent
make -j
make install
cd ..
ln -s openmpi-4.1.6 openmpi
编译 Slurm
wget https://download.schedmd.com/slurm/slurm-23.02.6.tar.bz2
tar xvf slurm-23.02.6.tar.bz2
cd slurm-23.02.6
./configure --prefix=/usr/local/slurm-23.02.6 --with-pmix=/usr/local/pmix --with-munge=/usr/local/munge --with-hwloc=/usr/local/hwloc --with-ucx=/usr/local/ucx
make -j
make install
cd ..
ln -s slurm-23.02.6 slurm
添加环境变量
参考 slurm集群安装与踩坑详解 | 我是谁 (yuhldr.github.io)。
配置(node 数目为 2)
slurm.conf 参见:slurm.conf (github.com)
gres.conf 参见:gres.conf (github.com)
启动
# control machine: /var/log/slurmctld.log 查看是否报错
munged
munge
slurmctld
slurmd
# node:/var/log/slurmd.log 查看是否报错
munged
munge
slurmd
检查
sinfo # 打印信息 是 idle 说明配置成功
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 2 idle 8A800-[1-2]
# 如果遇到 "undrain" 的情况
scontrol update nodename=node10 state=idle
集群训练模型
使用集群用 xtuner 微调 yi-34b 为例:
srun -p debug --job-name=xtuner --nodes=2 --gres=gpu:8 --ntasks-per-node=8 --kill-on-bad-exit=1 xtuner train yi_34b_qlora_alpaca_enzh_e3 --launcher slurm