基于 Docker 的 Slurm 作业管理系统

基于 Docker 的 Slurm 作业管理系统

阿里云服务器设置

参考视频:https://www.bilibili.com/video/BV177411K7bH

Step 1 -申请阿里云服务器

可以免费申请一个月的阿里云主机,我这里申请了一个月的1核2G的云服务器,带宽4M系统盘40G,安装的系统是 CentOS 8.4 64位版本。
在这里插入图片描述

Step 2 - 修改实例

进入云服务器 ECS 后
在这里插入图片描述
点击正在运行中的这个实例 i-uf689okdsil887t0h11x ,可以看到服务器的公网IP地址后面ssh登录时会用到
在这里插入图片描述
接下来修改实例主机名,并重置实例密码,修改后立即重启就好了。
在这里插入图片描述
在这里插入图片描述

Step 3 - 开通安全组,进行端口映射

在阿里云购买的云服务器需要开通安全组设置,否则外部无法访问
在这里插入图片描述
点击操作栏的配置规则,进入安全组中,添加你需要开通的端口号,最后例子用到了 8888 这个端口请务必打开
在这里插入图片描述
默认开放端口有22(后续ssh使用)我目前添加的端口号如上图,如后面有需要仍可进来添加。

Step 4 -使用xshell远程连接

去官网下载 Xshell 7 安装后即可,新建会话,填入你自己的阿里云公网 IP,再填写用户名 root 以及你刚刚服务器设置的密码即可进入服务器了。看到 Welcome to Alibaba Cloud Elastic Compute Service! 就说明你已经进入服务器了
在这里插入图片描述

# 按照提示输入将命令行激活
[root@Iceland ~]# systemctl enable --now cockpit.socket
Created symlink /etc/systemd/system/sockets.target.wants/cockpit.socket → /usr/lib/systemd/system/cockpit.socket.
# check 服务器当前环境
[root@Iceland ~]# pwd
/root
[root@Iceland ~]# cd /
[root@Iceland /]# ls
bin   dev  home  lib64  mnt  proc  run   srv  tmp  var
boot  etc  lib   media  opt  root  sbin  sys  usr
[root@Iceland ~]# uname -r		# 查看操作系统内核版本
4.18.0-305.3.1.el8.x86_64
[root@Iceland /]# cat /etc/os-release		# 查看操作系统详细信息
NAME="CentOS Linux"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"

在服务器上安装 Docker

官网文档:https://docs.docker.com/engine/install/centos/

Step 1 - 卸载老版本 docker

[root@Iceland /]# sudo yum remove docker \
> docker-client\
> docker-client-latest \
> docker-common \
> docker-latest \
> docker-latest-logrotate \
> docker-logrotate \
> docker-engine
No match for argument: docker
No match for argument: docker-clientdocker-client-latest
No match for argument: docker-common
No match for argument: docker-latest
No match for argument: docker-latest-logrotate
No match for argument: docker-logrotate
No match for argument: docker-engine
No packages marked for removal.
Dependencies resolved.
Nothing to do.
Complete!		# 由于是新服务器所以并没有这些老版本 docker

Step 2 - 安装镜像仓库

[root@Iceland /]# yum install -y yum-utils		# 安装 yum-utils 
Last metadata expiration check: 2:09:26 ago on Sat 28 Aug 2021 06:38:17 PM CST.
Dependencies resolved.
=============================================================================================
 Package               Architecture       Version                   Repository          Size
=============================================================================================
Installing:
 yum-utils             noarch             4.0.18-4.el8              baseos              71 k

Transaction Summary
=============================================================================================
Install  1 Package

Total download size: 71 k
Installed size: 22 k
Downloading Packages:
yum-utils-4.0.18-4.el8.noarch.rpm                            1.7 MB/s |  71 kB     00:00    
---------------------------------------------------------------------------------------------
Total                                                        1.6 MB/s |  71 kB     00:00     
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                     1/1 
  Installing       : yum-utils-4.0.18-4.el8.noarch                                       1/1 
  Running scriptlet: yum-utils-4.0.18-4.el8.noarch                                       1/1 
  Verifying        : yum-utils-4.0.18-4.el8.noarch                                       1/1 

Installed:
  yum-utils-4.0.18-4.el8.noarch                                                              

Complete!
[root@Iceland /]# yum-config-manager \		# 建立稳定链接的仓库
>     --add-repo \
>     https://download.docker.com/linux/centos/docker-ce.repo
Adding repo from: https://download.docker.com/linux/centos/docker-ce.repo		
# 国外仓库网站较慢后面会用阿里云的仓库

Step 3 - 安装 docker 引擎

[root@Iceland /]# yum install docker-ce docker-ce-cli containerd.io		# 安装这 3 个组件
Docker CE Stable - x86_64                                             18 kB/s |  15 kB     00:00    
Dependencies resolved.
=====================================================================================================
 Package                    Arch    Version                                  Repository         Size
=====================================================================================================
Installing:
 containerd.io              x86_64  1.4.9-3.1.el8                            docker-ce-stable   30 M
 docker-ce                  x86_64  3:20.10.8-3.el8                          docker-ce-stable   22 M
 docker-ce-cli              x86_64  1:20.10.8-3.el8                          docker-ce-stable   29 M
Installing dependencies:
 container-selinux          noarch  2:2.164.1-1.module_el8.4.0+886+c9a8d9ad  appstream          52 k
 docker-ce-rootless-extras  x86_64  20.10.8-3.el8                            docker-ce-stable  4.6 M
 docker-scan-plugin         x86_64  0.8.0-3.el8                              docker-ce-stable  4.2 M
 fuse-common                x86_64  3.2.1-12.el8                             baseos             21 k
 fuse-overlayfs             x86_64  1.6-1.module_el8.4.0+886+c9a8d9ad        appstream          73 k
 fuse3                      x86_64  3.2.1-12.el8                             baseos             50 k
 fuse3-libs                 x86_64  3.2.1-12.el8                             baseos             94 k
 libcgroup                  x86_64  0.41-19.el8                              baseos             70 k
 libslirp                   x86_64  4.3.1-1.module_el8.4.0+575+63b40ad7      appstream          69 k
 slirp4netns                x86_64  1.1.8-1.module_el8.4.0+641+6116a774      appstream          51 k
Enabling module streams:
 container-tools                    rhel8                                                           

Transaction Summary
=====================================================================================================
Install  13 Packages

Total download size: 90 M
Installed size: 377 M
Is this ok [y/N]: y		# 中间等待输入 y 即可
Downloading Packages:
(1/13): container-selinux-2.164.1-1.module_el8.4.0+886+c9a8d9ad.noar 1.4 MB/s |  52 kB     00:00    
(2/13): fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64.rpm  1.9 MB/s |  73 kB     00:00    
(3/13): libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64.rpm      1.3 MB/s |  69 kB     00:00    
(4/13): fuse-common-3.2.1-12.el8.x86_64.rpm                          1.4 MB/s |  21 kB     00:00    
(5/13): slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64.rpm   2.7 MB/s |  51 kB     00:00    
(6/13): fuse3-3.2.1-12.el8.x86_64.rpm                                4.0 MB/s |  50 kB     00:00    
(7/13): libcgroup-0.41-19.el8.x86_64.rpm                             4.6 MB/s |  70 kB     00:00    
(8/13): fuse3-libs-3.2.1-12.el8.x86_64.rpm                           4.7 MB/s |  94 kB     00:00    
(9/13): docker-ce-20.10.8-3.el8.x86_64.rpm                           5.5 MB/s |  22 MB     00:03    
(10/13): docker-ce-rootless-extras-20.10.8-3.el8.x86_64.rpm          3.5 MB/s | 4.6 MB     00:01    
(11/13): containerd.io-1.4.9-3.1.el8.x86_64.rpm                      4.7 MB/s |  30 MB     00:06    
(12/13): docker-scan-plugin-0.8.0-3.el8.x86_64.rpm                   3.5 MB/s | 4.2 MB     00:01    
(13/13): docker-ce-cli-20.10.8-3.el8.x86_64.rpm                      3.6 MB/s |  29 MB     00:08    
-----------------------------------------------------------------------------------------------------
Total                                                                 11 MB/s |  90 MB     00:08     
warning: /var/cache/dnf/docker-ce-stable-fa9dc42ab4cec2f4/packages/containerd.io-1.4.9-3.1.el8.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID 621e9f35: NOKEY
Docker CE Stable - x86_64                                            3.1 kB/s | 1.6 kB     00:00    
Importing GPG key 0x621E9F35:
 Userid     : "Docker Release (CE rpm) <docker@docker.com>"
 Fingerprint: 060A 61C5 1B55 8A7F 742B 77AA C52F EB6B 621E 9F35
 From       : https://download.docker.com/linux/centos/gpg
Is this ok [y/N]: y		# 中间等待输入 y 即可
Key imported successfully
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                             1/1 
  Installing       : docker-scan-plugin-0.8.0-3.el8.x86_64                                      1/13 
  Running scriptlet: docker-scan-plugin-0.8.0-3.el8.x86_64                                      1/13 
  Installing       : docker-ce-cli-1:20.10.8-3.el8.x86_64                                       2/13 
  Running scriptlet: docker-ce-cli-1:20.10.8-3.el8.x86_64                                       2/13 
  Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch           3/13 
  Installing       : container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch           3/13 
  Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch           3/13 
  Installing       : containerd.io-1.4.9-3.1.el8.x86_64                                         4/13 
  Running scriptlet: containerd.io-1.4.9-3.1.el8.x86_64                                         4/13 
  Running scriptlet: libcgroup-0.41-19.el8.x86_64                                               5/13 
  Installing       : libcgroup-0.41-19.el8.x86_64                                               5/13 
  Running scriptlet: libcgroup-0.41-19.el8.x86_64                                               5/13 
  Installing       : fuse3-libs-3.2.1-12.el8.x86_64                                             6/13 
  Running scriptlet: fuse3-libs-3.2.1-12.el8.x86_64                                             6/13 
  Installing       : fuse-common-3.2.1-12.el8.x86_64                                            7/13 
  Installing       : fuse3-3.2.1-12.el8.x86_64                                                  8/13 
  Installing       : fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64                    9/13 
  Running scriptlet: fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64                    9/13 
  Installing       : libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64                       10/13 
  Installing       : slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64                    11/13 
  Installing       : docker-ce-rootless-extras-20.10.8-3.el8.x86_64                            12/13 
  Running scriptlet: docker-ce-rootless-extras-20.10.8-3.el8.x86_64                            12/13 
  Installing       : docker-ce-3:20.10.8-3.el8.x86_64                                          13/13 
  Running scriptlet: docker-ce-3:20.10.8-3.el8.x86_64                                          13/13 
  Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch          13/13 
  Running scriptlet: docker-ce-3:20.10.8-3.el8.x86_64                                          13/13 
  Verifying        : container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch           1/13 
  Verifying        : fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64                    2/13 
  Verifying        : libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64                        3/13 
  Verifying        : slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64                     4/13 
  Verifying        : fuse-common-3.2.1-12.el8.x86_64                                            5/13 
  Verifying        : fuse3-3.2.1-12.el8.x86_64                                                  6/13 
  Verifying        : fuse3-libs-3.2.1-12.el8.x86_64                                             7/13 
  Verifying        : libcgroup-0.41-19.el8.x86_64                                               8/13 
  Verifying        : containerd.io-1.4.9-3.1.el8.x86_64                                         9/13 
  Verifying        : docker-ce-3:20.10.8-3.el8.x86_64                                          10/13 
  Verifying        : docker-ce-cli-1:20.10.8-3.el8.x86_64                                      11/13 
  Verifying        : docker-ce-rootless-extras-20.10.8-3.el8.x86_64                            12/13 
  Verifying        : docker-scan-plugin-0.8.0-3.el8.x86_64                                     13/13 
Installed:
  container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch                                   
  containerd.io-1.4.9-3.1.el8.x86_64                                                                
  docker-ce-3:20.10.8-3.el8.x86_64                                                                  
  docker-ce-cli-1:20.10.8-3.el8.x86_64                                                 
  docker-ce-rootless-extras-20.10.8-3.el8.x86_64                                         
  docker-scan-plugin-0.8.0-3.el8.x86_64                                                         
  fuse-common-3.2.1-12.el8.x86_64                                                                
  fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64                                
  fuse3-3.2.1-12.el8.x86_64                                                                
  fuse3-libs-3.2.1-12.el8.x86_64                                               
  libcgroup-0.41-19.el8.x86_64                                                          
  libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64                                      
  slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64                                    
Complete!

这个命令虽然已经安装了 docker,但是并没有启动(同服务器一样得启动后才运行)。

Step 4 - 启动 docker 并验证

[root@Iceland /]# systemctl start docker
[root@Iceland /]# docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
b8dfde127a29: Pull complete 
Digest: sha256:7d91b69e04a9029b99f3585aaaccae2baa80bcf318f4a5d2165a9898cd2dc0a1
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.		# 客户连接守护进程
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.(amd64)		# pull 镜像
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.		# 通过镜像生成容器运行
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.		# 守护进程将信息显示到终端

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/
For more examples and ideas, visit:
 https://docs.docker.com/get-started/

上面最重要的信息就是解释了 docker 运行的4步,至此 docker 也安装完成

[root@Iceland /]# docker version
Client: Docker Engine - Community
 Version:           20.10.8
 API version:       1.41
 Go version:        go1.16.6
 Git commit:        3967b7d
 Built:             Fri Jul 30 19:53:39 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true
Server: Docker Engine - Community
 Engine:
  Version:          20.10.8
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.6
  Git commit:       75249d8
  Built:            Fri Jul 30 19:52:00 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.9
  GitCommit:        e25210fe30a0a703442421b0f60afac609f950a3
 runc:
  Version:          1.0.1
  GitCommit:        v1.0.1-0-g4144b63
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Tips:阿里云镜像加速器

登录阿里云——>容器镜像服务——>镜像工具——>镜像加速器,将CentOS对应的4条命令复制运行即可。
在这里插入图片描述

[root@Iceland /]# sudo mkdir -p /etc/docker		# 新建目录
[root@Iceland /]# sudo tee /etc/docker/daemon.json <<-'EOF'		# 配置镜像地址文件
> {
>   "registry-mirrors": ["https://lisay8ar.mirror.aliyuncs.com"]
> }
> EOF
{
  "registry-mirrors": ["https://lisay8ar.mirror.aliyuncs.com"]
}
[root@Iceland /]# sudo systemctl daemon-reload		# 重启守护进程
[root@Iceland /]# sudo systemctl restart docker		# 重启 docker

Docker 网络配置

理解 docker0 桥接技术

[root@Iceland /]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo		# 本机回环地址
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
    inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0	# 阿里云内网地址
       valid_lft 315352421sec preferred_lft 315352421sec
    inet6 fe80::216:3eff:fe29:ef40/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0	# docker0 地址
       valid_lft forever preferred_lft forever
    inet6 fe80::42:5dff:fee9:e1b7/64 scope link 
       valid_lft forever preferred_lft forever

在这里插入图片描述
docker 里的每个容器都是通过与 docker0(类似路由器) 桥接从而在一个网段内进行通讯的。值得注意的是其中的某个容器都会和 docker0 搭建一对 evth-pair 虚拟接口,成对出现成对消失。这样才能保证容器之间既互相独立又能高效地互联互通,以及和外网通讯。

测试

[root@Iceland ~]# docker run -d -P --name tomcat01 tomcat		# -P表示端口随机映射新建容器运行
Unable to find image 'tomcat:latest' locally
latest: Pulling from library/tomcat
1cfaf5c6f756: Pull complete 
c4099a935a96: Pull complete 
f6e2960d8365: Pull complete 
dffd4e638592: Pull complete 
a60431b16af7: Pull complete 
4869c4e8de8d: Pull complete 
9815a275e5d0: Pull complete 
c36aa3d16702: Pull complete 
cc2e74b6c3db: Pull complete 
1827dd5c8bb0: Pull complete 
Digest: sha256:1af502b6fd35c1d4ab6f24dc9bd36b58678a068ff1206c25acc129fb90b2a76a
Status: Downloaded newer image for tomcat:latest
b530e79cc32b45ed6222496013b66ab663eaef74c83dc62610b252b18d1a3310
[root@Iceland ~]# docker exec -it tomcat01 ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo		# 本机回环地址
       valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0		# eth 桥接地址 6 和 7 一对
       valid_lft forever preferred_lft forever
[root@Iceland ~]# ping 172.17.0.2		# 直接可以通过地址从 Linux 命令行 ping 通到容器内部
PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.
64 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=0.101 ms
64 bytes from 172.17.0.2: icmp_seq=2 ttl=64 time=0.069 ms
64 bytes from 172.17.0.2: icmp_seq=3 ttl=64 time=0.064 ms
^C
--- 172.17.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2049ms
rtt min/avg/max/mdev = 0.064/0.078/0.101/0.016 ms
[root@Iceland ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
    inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0
       valid_lft 315312613sec preferred_lft 315312613sec
    inet6 fe80::216:3eff:fe29:ef40/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:5dff:fee9:e1b7/64 scope link 
       valid_lft forever preferred_lft forever
7: veth0a09b40@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 		# 相较之前IP地址多出的这一个就是和创建容器对应的 7 号桥接地址
    link/ether e6:88:6f:4a:e9:4c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::e488:6fff:fe4a:e94c/64 scope link 
       valid_lft forever preferred_lft forever

docker 会给每个容器分配一对接口用于容器与桥接器通讯,利用这项技术也就可以实现容器之间相互隔离且能高效通讯,为后面部署 Slurm 集群的通讯实现打下基础。
在这里插入图片描述

容器之间利用link技术相互通讯
由于容器IP可能会变,所以希望能通过 --link 即用容器ID替代IP进行通讯

[root@Iceland ~]# docker run -d -P --name tomcat02 tomcat
07758a3a228c004fbf6cc8092b714d1249f921c4ba9360846206fc7915083f97
[root@Iceland ~]# docker ps
CONTAINER ID   IMAGE     COMMAND             CREATED          STATUS          PORTS                                         NAMES
07758a3a228c   tomcat    "catalina.sh run"   5 seconds ago    Up 4 seconds    0.0.0.0:49154->8080/tcp, :::49154->8080/tcp   tomcat02
b530e79cc32b   tomcat    "catalina.sh run"   50 minutes ago   Up 50 minutes   0.0.0.0:49153->8080/tcp, :::49153->8080/tcp   tomcat01
[root@Iceland ~]# docker exec -it tomcat02 ping tomcat01
3ping: tomcat01: Name or service not known		# 发现直接通过容器名在一个容器里无法连接另一个容器
# 通过增加运行时指令 --link 可以解决
[root@Iceland ~]# docker run -d -P --name tomcat03 --link tomcat02 tomcat
6e185946062f3af377eb58c34408471685cca20d8ca0b2873b24514856eda7d8
[root@Iceland ~]# docker exec -it tomcat03 ping tomcat02		# 通过指定03与02连接,发现可以互联
PING tomcat02 (172.17.0.3) 56(84) bytes of data.
64 bytes from tomcat02 (172.17.0.3): icmp_seq=1 ttl=64 time=0.131 ms
64 bytes from tomcat02 (172.17.0.3): icmp_seq=2 ttl=64 time=0.091 ms
64 bytes from tomcat02 (172.17.0.3): icmp_seq=3 ttl=64 time=0.076 ms
^C
--- tomcat02 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 53ms
rtt min/avg/max/mdev = 0.076/0.099/0.131/0.024 ms
# 但是反向02却ping不同03,因为双向都需要配置

通过命令查询 tomcat03 的桥接器信息,–link 就相当于在 hosts 配置中添加一行对 02 的单向映射。

[root@Iceland ~]# docker exec -it tomcat03 cat /etc/hosts
127.0.0.1	localhost
::1	localhost ip6-localhost ip6-loopback
fe00::0	ip6-localnet
ff00::0	ip6-mcastprefix
ff02::1	ip6-allnodes
ff02::2	ip6-allrouters
172.17.0.3	tomcat02 07758a3a228c		# 这里就绑定了 02 
172.17.0.4	6e185946062f
[root@Iceland ~]# docker exec -it tomcat02 cat /etc/hosts
127.0.0.1	localhost
::1	localhost ip6-localhost ip6-loopback
fe00::0	ip6-localnet
ff00::0	ip6-mcastprefix
ff02::1	ip6-allnodes
ff02::2	ip6-allrouters
172.17.0.3	07758a3a228c

从这里也可以看出 docker0 的不方便,–link是官方定义的有局限性,无法自定义,不可能每个方向都绑定一遍。而且docker0 不支持容器ID连接访问。

进阶搭建自定义网络

网络模式

  • bridge:桥接模式 docker0(默认)
  • none:不配置网络
  • host:和宿主机共享网络
  • container:容器网络连通(局限很大)
[root@Iceland ~]# docker network --help
Usage:  docker network COMMAND
Manage networks
Commands:
  connect     Connect a container to a network
  create      Create a network		# 通过 create 创建自定义桥接网络
  disconnect  Disconnect a container from a network
  inspect     Display detailed information on one or more networks
  ls          List networks
  prune       Remove all unused networks
  rm          Remove one or more networks
Run 'docker network COMMAND --help' for more information on a command.
# docker0 是默认域名不能访问
[root@Iceland ~]# docker rm -f $(docker ps -aq)		# 首先将之前的容器及其网络配置删除
6e185946062f
07758a3a228c
b530e79cc32b
[root@Iceland ~]# docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
[root@Iceland ~]# docker images
REPOSITORY   TAG       IMAGE ID       CREATED       SIZE
tomcat       latest    266d1269bb29   10 days ago   668MB
[root@Iceland ~]# ip addr		# 可以看到这里已经只有最开始3行网络了
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
    inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0
       valid_lft 315310384sec preferred_lft 315310384sec
    inet6 fe80::216:3eff:fe29:ef40/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:5dff:fee9:e1b7/64 scope link 
       valid_lft forever preferred_lft forever
# 创建新的桥接网络,--driver [网络类型] --subnet [子网范围]  --gateway [网关地址]
[root@Iceland ~]# docker network create --driver bridge --subnet 192.168.0.0/16 --gateway 192.168.0.1 mynet
57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e
[root@Iceland ~]# docker network ls
NETWORK ID     NAME      DRIVER    SCOPE
9223b334e60a   bridge    bridge    local
8d96801ccaf3   host      host      local
57c914464f0a   mynet     bridge    local
a5ff794b6d74   none      null      local
[root@Iceland ~]# docker network inspect mynet
[
    {
        "Name": "mynet",
        "Id": "57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e",
        "Created": "2021-08-29T09:07:38.248210817+08:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "192.168.0.0/16",		# 看到网络已经设置好了
                    "Gateway": "192.168.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {},
        "Options": {},
        "Labels": {}
    }
]

测试自定义网络

[root@Iceland ~]# docker run -d -P --name tomcat-net-01 --net mynet tomcat
c2e8c4d6ec1af68bea8dcad213a9c693151859667f26336c596aedf4189aa898
[root@Iceland ~]# docker run -d -P --name tomcat-net-02 --net mynet tomcat
91ce2929f0083f0bba803fa12ccf11b1b0cff36b3c807ada42e5fbe1aadef1cb
[root@Iceland ~]# docker network inspect mynet
[
    {
        "Name": "mynet",
        "Id": "57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e",
        "Created": "2021-08-29T09:07:38.248210817+08:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "192.168.0.0/16",
                    "Gateway": "192.168.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "91ce2929f0083f0bba803fa12ccf11b1b0cff36b3c807ada42e5fbe1aadef1cb": {
                "Name": "tomcat-net-02",
                "EndpointID": "4df2dc1c5314bb02ae69ef7b47e32e658cb3aaaf7c65074bfddfe38629ba65be",
                "MacAddress": "02:42:c0:a8:00:03",
                "IPv4Address": "192.168.0.3/16",		# 看到这里的IP就是我们定义的192.168.0.3
                "IPv6Address": ""
            },
            "c2e8c4d6ec1af68bea8dcad213a9c693151859667f26336c596aedf4189aa898": {
                "Name": "tomcat-net-01",
                "EndpointID": "7d92fc552cb88f410b207075e473afde36f63020dc63f0de7923fd7137e19b1f",
                "MacAddress": "02:42:c0:a8:00:02",
                "IPv4Address": "192.168.0.2/16",		# 看到这里的IP就是我们定义的192.168.0.2
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {}
    }
]

自定义桥接网络好处是不同网络(不同子网)相互隔离,但是网络内互联互通,非常完善,两个容器可以相互 ping 通,修复了 --link 的问题。

[root@Iceland ~]# docker exec -it tomcat-net-01 ping 192.168.0.3
PING 192.168.0.3 (192.168.0.3) 56(84) bytes of data.
64 bytes from 192.168.0.3: icmp_seq=1 ttl=64 time=0.119 ms
64 bytes from 192.168.0.3: icmp_seq=2 ttl=64 time=0.092 ms
64 bytes from 192.168.0.3: icmp_seq=3 ttl=64 time=0.080 ms
^C
--- 192.168.0.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 64ms
rtt min/avg/max/mdev = 0.080/0.097/0.119/0.016 ms
[root@Iceland ~]# docker exec -it tomcat-net-02 ping 192.168.0.2
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.2: icmp_seq=2 ttl=64 time=0.101 ms
64 bytes from 192.168.0.2: icmp_seq=3 ttl=64 time=0.102 ms
^C
--- 192.168.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 27ms
rtt min/avg/max/mdev = 0.101/0.106/0.116/0.010 ms
[root@Iceland ~]# docker exec -it tomcat-net-02 ping tomcat-net-01		# 直接通过容器ID也可以
PING tomcat-net-01 (192.168.0.2) 56(84) bytes of data.
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=1 ttl=64 time=0.098 ms
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=2 ttl=64 time=0.098 ms
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=3 ttl=64 time=0.086 ms
^C
--- tomcat-net-01 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 51ms
rtt min/avg/max/mdev = 0.086/0.094/0.098/0.005 ms

Docker Compose——批量容器编排

官方文档:https://docs.docker.com/compose/

原来的单个容器的 docker 流程,docker file——>docker build——>docker run,需要手动操作单个容器,在面对大量集群时捉襟见肘,所以有了docker compose 通过配置文件实现多个容器的自动化操作

官方介绍

Using Compose is basically a three-step process:

  1. Define your app’s environment with a Dockerfile so it can be reproduced anywhere.
  2. Define the services that make up your app in docker-compose.yml so they can be run together in an isolated environment.
  3. Run docker compose up and the Docker compose command starts and runs your entire app. You can alternatively run docker-compose up using the docker-compose binary.

compose 的两个重要概念:

  • 服务 services,容器,应用
  • 项目 project,一组关联的容器

Step 1 - 安装 compose

# 先从 GitHub 下载 compose 文件
[root@Iceland ~]# sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
# 给文件授权
[root@Iceland ~]# sudo chmod +x /usr/local/bin/docker-compose
[root@Iceland ~]# docker-compose version		# 证明安装成功
docker-compose version 1.29.2, build 5becea4c
docker-py version: 5.0.0
CPython version: 3.7.10
OpenSSL version: OpenSSL 1.1.0l  10 Sep 2019

Step 2 - 官网实例

官网实例为一个python 应用,计数器且使用 redis,因为服务器太慢一直下载错误就不演示了,大概步骤如下:

  • step 1:写应用 app.py
  • step 2:dockerfile 应用打包成镜像(单机应用没有上线)
  • step 3:docker-compose yaml文件(定义整个服务,线上需要的环境)——核心文件
  • step 4:启动 compose 项目(docker-compose up运行整套服务)

Slurm 集群搭建实验

参考blog:https://medium.com/analytics-vidhya/slurm-cluster-with-docker-9f242deee601

Step 1 -Slurm 架构说明

我们将使用 docker-compose 创建一个 Slurm 集群,它允许我们从 docker 镜像(作者构建的)创建一个环境。 Docker-compose 将创建容器和网络,以便在隔离的环境中进行通信。每个容器都是集群的一个组件。

  • slurmmaster 是带有 slurmctld(Slurm 的中央管理守护进程)的容器。
  • slurmnode[1-3] 是带有 slurmd(Slurm 的计算节点守护进程)的容器。
  • slurmjupyter 是带有 jupyterlab 的容器。这允许使用 JupyterLab 作为集群客户端与集群进行交互。作为最终用户,我们将通过浏览器使用 JupyterLab 与 Slurm 进行交互。
  • cluster_default 网络,docker-compose 将创建一个网络来加入并保留所有容器。网络内部的容器可以看到彼此。
    以下方案显示了所有组件如何交互。

在这里插入图片描述

Step 2 写yaml 文件

由于均是使用镜像,所以整个项目只需要yaml文件,在里面定义了pull images 的各个步骤,运行只需在命令行输入 docker-compose up -d 即可。

# 新建文件夹 cluster 用来存放文件
[root@Iceland ~]# mkdir cluster
[root@Iceland ~]# ls
cluster  composetest
[root@Iceland ~]# cd cluster
[root@Iceland cluster]# vim docker-compose.yml

docker-compose.yml 文件如下:

services:
  slurmjupyter:		# 开通容器 slurmjupyter
        image: rancavil/slurm-jupyter:19.05.5-1		# 镜像仓库rancavil是作者名字 Rodrigo Ancavil 缩写
        hostname: slurmjupyter
        user: admin
        volumes:
                - shared-vol:/home/admin
        ports:
                - 8888:8888
  slurmmaster:
        image: rancavil/slurm-master:19.05.5-1
        hostname: slurmmaster
        user: admin
        volumes:
                - shared-vol:/home/admin
        ports:
                - 6817:6817
                - 6818:6818
                - 6819:6819
  slurmnode1:		# 定义容器1的各项参数
        image: rancavil/slurm-node:19.05.5-1
        hostname: slurmnode1
        user: admin
        volumes:
                - shared-vol:/home/admin
        environment:
                - SLURM_NODENAME=slurmnode1
        links:
                - slurmmaster		# 和之前自定义网络类似,这里定义 node1 能与 master 连接,下面同理
  slurmnode2:
        image: rancavil/slurm-node:19.05.5-1
        hostname: slurmnode2
        user: admin
        volumes:
                - shared-vol:/home/admin
        environment:
                - SLURM_NODENAME=slurmnode2
        links:
                - slurmmaster
  slurmnode3:
        image: rancavil/slurm-node:19.05.5-1
        hostname: slurmnode3
        user: admin
        volumes:
                - shared-vol:/home/admin
        environment:
                - SLURM_NODENAME=slurmnode3
        links:
                - slurmmaster
volumes:
        shared-vol:

Step 3 - 运行 docker-compose up

[root@Iceland cluster]# docker-compose up -d		# 开始部署,接下来是安装步骤
Creating network "cluster_default" with the default driver	# docker-compose会自动按yaml生成自定义网络
Creating volume "cluster_shared-vol" with default driver
Pulling slurmjupyter (rancavil/slurm-jupyter:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-jupyter
83ee3a23efb7: Pull complete
db98fc6f11f0: Pull complete
f611acd52c6c: Pull complete
87f6e2c4791b: Pull complete
1301353d4fa3: Pull complete
3347f4fbce33: Pull complete
0cf1a37339f3: Pull complete
e78d0881f8c1: Pull complete
37049fe9d876: Pull complete
a8fa566a7a57: Pull complete
24af49ba4a2f: Pull complete
97b9029f86ee: Pull complete
Digest: sha256:17a72e8e4c5d687359c2923af7166e84f9bd3b63146145421bbac006ce141d45
Status: Downloaded newer image for rancavil/slurm-jupyter:19.05.5-1
Pulling slurmmaster (rancavil/slurm-master:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-master
83ee3a23efb7: Already exists
db98fc6f11f0: Already exists
f611acd52c6c: Already exists
87f6e2c4791b: Already exists
e216e1a311d3: Pull complete
ab998a26ee04: Pull complete
499f3426618c: Pull complete
b5b815649fa6: Pull complete
2f04debb872c: Pull complete
4050a9c6f8d3: Pull complete
Digest: sha256:1979f86166b58213380604dcd7c1fcdb2438a40c44add2ff356be47160a97ab3
Status: Downloaded newer image for rancavil/slurm-master:19.05.5-1
Pulling slurmnode1 (rancavil/slurm-node:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-node
83ee3a23efb7: Already exists
db98fc6f11f0: Already exists
f611acd52c6c: Already exists
87f6e2c4791b: Already exists
d82ef016a552: Pull complete
5865a097296e: Pull complete
0602a8c59a76: Pull complete
6f2545f38103: Pull complete
608c665d03da: Pull complete
c80540692f3b: Pull complete
Digest: sha256:ae650d12fbdaddd29208d7638aa0498c655bfe5a33f4fd07d57e51eb211f18c2
Status: Downloaded newer image for rancavil/slurm-node:19.05.5-1
Creating cluster_slurmmaster_1  ... done
Creating cluster_slurmjupyter_1 ... done
Creating cluster_slurmnode1_1   ... done
Creating cluster_slurmnode2_1   ... done
Creating cluster_slurmnode3_1   ... done
[root@Iceland cluster]# docker-compose ps		# 可以看到 5 个容器都运行正常
         Name                       Command               State                      Ports                   
-------------------------------------------------------------------------------------------------------------
cluster_slurmjupyter_1   /etc/slurm-llnl/docker-ent ...   Up      0.0.0.0:8888->8888/tcp,:::8888->8888/tcp   
cluster_slurmmaster_1    /etc/slurm-llnl/docker-ent ...   Up      3306/tcp,                                  
                                                                  0.0.0.0:6817->6817/tcp,:::6817->6817/tcp,  
                                                                  0.0.0.0:6818->6818/tcp,:::6818->6818/tcp,  
                                                                  0.0.0.0:6819->6819/tcp,:::6819->6819/tcp   
cluster_slurmnode1_1     /etc/slurm-llnl/docker-ent ...   Up      6817/tcp, 6818/tcp, 6819/tcp               
cluster_slurmnode2_1     /etc/slurm-llnl/docker-ent ...   Up      6817/tcp, 6818/tcp, 6819/tcp               
cluster_slurmnode3_1     /etc/slurm-llnl/docker-ent ...   Up      6817/tcp, 6818/tcp, 6819/tcp 

记得将服务器的 IP地址:8888 输入浏览器即可看到我们运行的 JupyterLab 界面

在这里插入图片描述

这就是安装好的 Slurm 队列扩展功能

在这里插入图片描述

点击这个按钮,就进入 Slurm Queue 管理界面

在这里插入图片描述

在之前页面点击命令行按钮进入内部查看

在这里插入图片描述

admin@slurmjupyter:~$ scontrol show node		# 查看节点信息,看见3个节点都在
NodeName=slurmnode1 Arch=x86_64 CoresPerSocket=1 
   CPUAlloc=0 CPUTot=1 CPULoad=0.31
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=slurmnode1 NodeHostName=slurmnode1 Version=19.05.5
   OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021 
   RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=slurmpar 
   BootTime=2021-08-28T11:15:59 SlurmdStartTime=2021-08-29T06:38:14
   CfgTRES=cpu=1,mem=1M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=slurmnode2 Arch=x86_64 CoresPerSocket=1 
   CPUAlloc=0 CPUTot=1 CPULoad=0.31
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=slurmnode2 NodeHostName=slurmnode2 Version=19.05.5
   OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021 
   RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=slurmpar 
   BootTime=2021-08-28T11:16:00 SlurmdStartTime=2021-08-29T06:38:15
   CfgTRES=cpu=1,mem=1M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=slurmnode3 Arch=x86_64 CoresPerSocket=1 
   CPUAlloc=0 CPUTot=1 CPULoad=0.31
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=slurmnode3 NodeHostName=slurmnode3 Version=19.05.5
   OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021 
   RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=slurmpar 
   BootTime=2021-08-28T11:16:00 SlurmdStartTime=2021-08-29T06:38:15
   CfgTRES=cpu=1,mem=1M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Step 4 -运行一个 Slurm 例子

先在 JupyterLab 新建一个文件并重命名为 test.py,并输入以下代码——简单让程序休眠15s

#!/usr/bin/env python3
  
import time
import os
import socket
from datetime import datetime as dt
if __name__ == '__main__':
    print('Process started {}'.format(dt.now()))
    print('NODE : {}'.format(socket.gethostname()))
    print('PID  : {}'.format(os.getpid()))
    print('Executing for 15 secs')
    time.sleep(15)
    print('Process finished {}\n'.format(dt.now()))

在这里插入图片描述

继续新建一个脚本文件 job.sh 把工作分配给所有 slurmnode[1-3],这里指定将 test.py 进行广播 且任务数为 3,结果输出到文件 result.out

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=result.out
#
#SBATCH --ntasks=3
#
sbcast -f test.py /tmp/test.py
srun python3 /tmp/test.py

在这里插入图片描述

然后进入 Slurm Queue 管理界面,点击 Submit Job 提交集群需要做的工作,这里我们提交 job.sh 文件即可,类型选文件,路径是 /home/admin/job.sh,点击提交作业。

在这里插入图片描述

记得再点击一下 Reload 将作业装填到系统,这样工作就在集群跑起来了,过15s 左右侧边栏就会多出一个 result.out 的输出文件,双击查看就是 3 个计算结点并行计算得到的结果。

在这里插入图片描述

至此Slurm 的实例完成,受限于购买的免费服务器是1核2G的原因,无法提交更复杂的矩阵乘法。

最后记得关闭服务

[root@Iceland cluster]# docker-compose stop
Stopping cluster_slurmnode1_1   ... done
Stopping cluster_slurmnode2_1   ... done
Stopping cluster_slurmnode3_1   ... done
Stopping cluster_slurmjupyter_1 ... done
Stopping cluster_slurmmaster_1  ... done
[root@Iceland cluster]# docker-compose ps
         Name                       Command                State     Ports
--------------------------------------------------------------------------
cluster_slurmjupyter_1   /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmmaster_1    /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode1_1     /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode2_1     /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode3_1     /etc/slurm-llnl/docker-ent ...   Exit 137 
填到系统,这样工作就在集群跑起来了,过15s 左右侧边栏就会多出一个 result.out 的输出文件,双击查看就是 3 个计算结点并行计算得到的结果。

[外链图片转存中...(img-mS70KiC7-1630323014459)]

>至此Slurm 的实例完成,受限于购买的免费服务器是1核2G的原因,无法提交更复杂的矩阵乘法。

最后记得关闭服务

```shell
[root@Iceland cluster]# docker-compose stop
Stopping cluster_slurmnode1_1   ... done
Stopping cluster_slurmnode2_1   ... done
Stopping cluster_slurmnode3_1   ... done
Stopping cluster_slurmjupyter_1 ... done
Stopping cluster_slurmmaster_1  ... done
[root@Iceland cluster]# docker-compose ps
         Name                       Command                State     Ports
--------------------------------------------------------------------------
cluster_slurmjupyter_1   /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmmaster_1    /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode1_1     /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode2_1     /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode3_1     /etc/slurm-llnl/docker-ent ...   Exit 137 

特别感谢B站狂神的docker视频,“只要学不死,就往死里学”

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值