替换清华源
a) ubuntu | 镜像站使用帮助 | 清华大学开源软件镜像站 | Tsinghua Open Source Mirror
i. https://mirrors.tuna.tsinghua.edu.cn/help/ubuntu/
apt update
安装munge
apt install nstall libmunge-dev libmunge2 munge
启动
/etc/init.d/munge start
后续可通过systemctl管理:systemctl status munge
查看是否可以联通
所有节点配置ssh密钥
munge -n | unmunge
munge -n | ssh 192.168.0.65 unmunge //所有节点都需要通
安装slurm
apt-get install binutils
useradd slurm
apt-get install build-essential
wget https://download.schedmd.com/slurm/slurm-21.08.8-2.tar.bz2
tar -jxvf -
cd slurm-21.08.8-2
编译: ./configure --prefix=/usr/local/slurm && make && make install
PATH: echo "PATH=$PATH:/usr/local/slurm/bin:/usr/local/slurm/sbin" >> /root/.bashrc
配置运行动态库链接:
ldconfig -n /usr/local/slurm/lib
创建配置文件
mkdir /usr/local/slurm/etc
cp /root/ slurm/slurm-21.08.8-2 etc/cgroup.conf.example /usr/local/slurm/etc/cgroup.conf
cp /root/slurm/slurm-21.08.8-2 /etc/slurm.conf.example /usr/local/slurm/etc/slurm.conf //自行修改节点配置 ,且每台机器都需要一样的配置
! optional: https://slurm.schedmd.com/configurator.html Slurm.conf在线生成网址。可以在上面配置
重点配置:
配置 gres.conf //按照卡型号配置即可。
root@0-65:/usr/local/slurm/etc# cat gres.conf
NodeName=0-65 Name=gpu File=/dev/nvidia0 Type=1660
插件管理
mkdir /usr/local/slurm/etc/plugstack.conf.d
echo "include /usr/local/slurm/etc/plugstack.conf.d/*.conf" >> /usr/local/slurm/etc/plugstack.conf
//拷贝到每台机器
etc 配置文件
11.
配置systemctl等
chmod 777 /var/spool/
cp /root/slurm/slurm-21.08.8-2/etc/*.service /etc/systemd/system/
systemctl daemon-reload
systemctl restart slurmd slurmctld.service
source /root/.bashrc
scontrol show node
正常输出。IDLE 为可用。 可能会显示down,修改下节点状态就可以了。
scontrol update nodename=[node_name] state=resume //大概这种
Enroot
wget https://github.com/NVIDIA/enroot/releases/download/v3.1.0/enroot+caps_3.1.0-1_amd64.deb
wget https://github.com/NVIDIA/enroot/releases/download/v3.1.0/enroot_3.1.0-1_amd64.deb
apt install zstd jq
chmod a+r ./*.deb
sudo apt install -y ./*.deb 或者dpkg -i *.deb
apt install -y git gcc make libcap2-bin libtool automake libmd-dev
apt install -y curl gawk jq squashfs-tools parallel
apt install -y fuse-overlayfs libnvidia-container-tools pigz squashfuse # optional
enroot version //一定要3.1.0的
pyxis
tar -zxvf pyxis-0.13.0.tar.gz
cd pyxis-0.13.0/
make
make install
ln -s /usr/local/share/pyxis/pyxis.conf /usr/local/slurm/etc/plugstack.conf.d/
重启slurmd slurmctld
systemctl restart slurmd slurmctld
查看srun --help 有没有image 的输出
srun --help | grep image
有则说明pyxis 正确安装了。
测试
srun grep PRETTY /etc/os-release
//PRETTY_NAME="Ubuntu 20.04.2 LTS"
srun --container-image=centos grep PRETTY /etc/os-release // 这一步需要有docker //每个node节点都需要安装docker
//PRETTY_NAME="CentOS Linux 8"
.
Slurm 节点需要安装nvidia-container-runtime //包瞎找一个。docker 能启动基本没问题。
vim /etc/docker/daemon.json
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"experimental": true
Systemctl restart docker
Systemctl restart slurmd
以下是测试问题
Pyxis 显示/run/pyxis容量不够
修改pyxis 镜像存储目录
重启slurmd