基于 Docker 的 Slurm 作业管理系统
阿里云服务器设置
参考视频:https://www.bilibili.com/video/BV177411K7bH
Step 1 -申请阿里云服务器
可以免费申请一个月的阿里云主机,我这里申请了一个月的1核2G的云服务器,带宽4M系统盘40G,安装的系统是 CentOS 8.4 64位版本。
Step 2 - 修改实例
进入云服务器 ECS 后
点击正在运行中的这个实例 i-uf689okdsil887t0h11x ,可以看到服务器的公网IP地址后面ssh登录时会用到
接下来修改实例主机名,并重置实例密码,修改后立即重启就好了。
Step 3 - 开通安全组,进行端口映射
在阿里云购买的云服务器需要开通安全组设置,否则外部无法访问
点击操作栏的配置规则,进入安全组中,添加你需要开通的端口号,最后例子用到了 8888 这个端口请务必打开
默认开放端口有22(后续ssh使用)我目前添加的端口号如上图,如后面有需要仍可进来添加。
Step 4 -使用xshell远程连接
去官网下载 Xshell 7 安装后即可,新建会话,填入你自己的阿里云公网 IP,再填写用户名 root 以及你刚刚服务器设置的密码即可进入服务器了。看到 Welcome to Alibaba Cloud Elastic Compute Service! 就说明你已经进入服务器了
# 按照提示输入将命令行激活
[root@Iceland ~]# systemctl enable --now cockpit.socket
Created symlink /etc/systemd/system/sockets.target.wants/cockpit.socket → /usr/lib/systemd/system/cockpit.socket.
# check 服务器当前环境
[root@Iceland ~]# pwd
/root
[root@Iceland ~]# cd /
[root@Iceland /]# ls
bin dev home lib64 mnt proc run srv tmp var
boot etc lib media opt root sbin sys usr
[root@Iceland ~]# uname -r # 查看操作系统内核版本
4.18.0-305.3.1.el8.x86_64
[root@Iceland /]# cat /etc/os-release # 查看操作系统详细信息
NAME="CentOS Linux"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
在服务器上安装 Docker
官网文档:https://docs.docker.com/engine/install/centos/
Step 1 - 卸载老版本 docker
[root@Iceland /]# sudo yum remove docker \
> docker-client\
> docker-client-latest \
> docker-common \
> docker-latest \
> docker-latest-logrotate \
> docker-logrotate \
> docker-engine
No match for argument: docker
No match for argument: docker-clientdocker-client-latest
No match for argument: docker-common
No match for argument: docker-latest
No match for argument: docker-latest-logrotate
No match for argument: docker-logrotate
No match for argument: docker-engine
No packages marked for removal.
Dependencies resolved.
Nothing to do.
Complete! # 由于是新服务器所以并没有这些老版本 docker
Step 2 - 安装镜像仓库
[root@Iceland /]# yum install -y yum-utils # 安装 yum-utils
Last metadata expiration check: 2:09:26 ago on Sat 28 Aug 2021 06:38:17 PM CST.
Dependencies resolved.
=============================================================================================
Package Architecture Version Repository Size
=============================================================================================
Installing:
yum-utils noarch 4.0.18-4.el8 baseos 71 k
Transaction Summary
=============================================================================================
Install 1 Package
Total download size: 71 k
Installed size: 22 k
Downloading Packages:
yum-utils-4.0.18-4.el8.noarch.rpm 1.7 MB/s | 71 kB 00:00
---------------------------------------------------------------------------------------------
Total 1.6 MB/s | 71 kB 00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : yum-utils-4.0.18-4.el8.noarch 1/1
Running scriptlet: yum-utils-4.0.18-4.el8.noarch 1/1
Verifying : yum-utils-4.0.18-4.el8.noarch 1/1
Installed:
yum-utils-4.0.18-4.el8.noarch
Complete!
[root@Iceland /]# yum-config-manager \ # 建立稳定链接的仓库
> --add-repo \
> https://download.docker.com/linux/centos/docker-ce.repo
Adding repo from: https://download.docker.com/linux/centos/docker-ce.repo
# 国外仓库网站较慢后面会用阿里云的仓库
Step 3 - 安装 docker 引擎
[root@Iceland /]# yum install docker-ce docker-ce-cli containerd.io # 安装这 3 个组件
Docker CE Stable - x86_64 18 kB/s | 15 kB 00:00
Dependencies resolved.
=====================================================================================================
Package Arch Version Repository Size
=====================================================================================================
Installing:
containerd.io x86_64 1.4.9-3.1.el8 docker-ce-stable 30 M
docker-ce x86_64 3:20.10.8-3.el8 docker-ce-stable 22 M
docker-ce-cli x86_64 1:20.10.8-3.el8 docker-ce-stable 29 M
Installing dependencies:
container-selinux noarch 2:2.164.1-1.module_el8.4.0+886+c9a8d9ad appstream 52 k
docker-ce-rootless-extras x86_64 20.10.8-3.el8 docker-ce-stable 4.6 M
docker-scan-plugin x86_64 0.8.0-3.el8 docker-ce-stable 4.2 M
fuse-common x86_64 3.2.1-12.el8 baseos 21 k
fuse-overlayfs x86_64 1.6-1.module_el8.4.0+886+c9a8d9ad appstream 73 k
fuse3 x86_64 3.2.1-12.el8 baseos 50 k
fuse3-libs x86_64 3.2.1-12.el8 baseos 94 k
libcgroup x86_64 0.41-19.el8 baseos 70 k
libslirp x86_64 4.3.1-1.module_el8.4.0+575+63b40ad7 appstream 69 k
slirp4netns x86_64 1.1.8-1.module_el8.4.0+641+6116a774 appstream 51 k
Enabling module streams:
container-tools rhel8
Transaction Summary
=====================================================================================================
Install 13 Packages
Total download size: 90 M
Installed size: 377 M
Is this ok [y/N]: y # 中间等待输入 y 即可
Downloading Packages:
(1/13): container-selinux-2.164.1-1.module_el8.4.0+886+c9a8d9ad.noar 1.4 MB/s | 52 kB 00:00
(2/13): fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64.rpm 1.9 MB/s | 73 kB 00:00
(3/13): libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64.rpm 1.3 MB/s | 69 kB 00:00
(4/13): fuse-common-3.2.1-12.el8.x86_64.rpm 1.4 MB/s | 21 kB 00:00
(5/13): slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64.rpm 2.7 MB/s | 51 kB 00:00
(6/13): fuse3-3.2.1-12.el8.x86_64.rpm 4.0 MB/s | 50 kB 00:00
(7/13): libcgroup-0.41-19.el8.x86_64.rpm 4.6 MB/s | 70 kB 00:00
(8/13): fuse3-libs-3.2.1-12.el8.x86_64.rpm 4.7 MB/s | 94 kB 00:00
(9/13): docker-ce-20.10.8-3.el8.x86_64.rpm 5.5 MB/s | 22 MB 00:03
(10/13): docker-ce-rootless-extras-20.10.8-3.el8.x86_64.rpm 3.5 MB/s | 4.6 MB 00:01
(11/13): containerd.io-1.4.9-3.1.el8.x86_64.rpm 4.7 MB/s | 30 MB 00:06
(12/13): docker-scan-plugin-0.8.0-3.el8.x86_64.rpm 3.5 MB/s | 4.2 MB 00:01
(13/13): docker-ce-cli-20.10.8-3.el8.x86_64.rpm 3.6 MB/s | 29 MB 00:08
-----------------------------------------------------------------------------------------------------
Total 11 MB/s | 90 MB 00:08
warning: /var/cache/dnf/docker-ce-stable-fa9dc42ab4cec2f4/packages/containerd.io-1.4.9-3.1.el8.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID 621e9f35: NOKEY
Docker CE Stable - x86_64 3.1 kB/s | 1.6 kB 00:00
Importing GPG key 0x621E9F35:
Userid : "Docker Release (CE rpm) <docker@docker.com>"
Fingerprint: 060A 61C5 1B55 8A7F 742B 77AA C52F EB6B 621E 9F35
From : https://download.docker.com/linux/centos/gpg
Is this ok [y/N]: y # 中间等待输入 y 即可
Key imported successfully
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : docker-scan-plugin-0.8.0-3.el8.x86_64 1/13
Running scriptlet: docker-scan-plugin-0.8.0-3.el8.x86_64 1/13
Installing : docker-ce-cli-1:20.10.8-3.el8.x86_64 2/13
Running scriptlet: docker-ce-cli-1:20.10.8-3.el8.x86_64 2/13
Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch 3/13
Installing : container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch 3/13
Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch 3/13
Installing : containerd.io-1.4.9-3.1.el8.x86_64 4/13
Running scriptlet: containerd.io-1.4.9-3.1.el8.x86_64 4/13
Running scriptlet: libcgroup-0.41-19.el8.x86_64 5/13
Installing : libcgroup-0.41-19.el8.x86_64 5/13
Running scriptlet: libcgroup-0.41-19.el8.x86_64 5/13
Installing : fuse3-libs-3.2.1-12.el8.x86_64 6/13
Running scriptlet: fuse3-libs-3.2.1-12.el8.x86_64 6/13
Installing : fuse-common-3.2.1-12.el8.x86_64 7/13
Installing : fuse3-3.2.1-12.el8.x86_64 8/13
Installing : fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64 9/13
Running scriptlet: fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64 9/13
Installing : libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64 10/13
Installing : slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64 11/13
Installing : docker-ce-rootless-extras-20.10.8-3.el8.x86_64 12/13
Running scriptlet: docker-ce-rootless-extras-20.10.8-3.el8.x86_64 12/13
Installing : docker-ce-3:20.10.8-3.el8.x86_64 13/13
Running scriptlet: docker-ce-3:20.10.8-3.el8.x86_64 13/13
Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch 13/13
Running scriptlet: docker-ce-3:20.10.8-3.el8.x86_64 13/13
Verifying : container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch 1/13
Verifying : fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64 2/13
Verifying : libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64 3/13
Verifying : slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64 4/13
Verifying : fuse-common-3.2.1-12.el8.x86_64 5/13
Verifying : fuse3-3.2.1-12.el8.x86_64 6/13
Verifying : fuse3-libs-3.2.1-12.el8.x86_64 7/13
Verifying : libcgroup-0.41-19.el8.x86_64 8/13
Verifying : containerd.io-1.4.9-3.1.el8.x86_64 9/13
Verifying : docker-ce-3:20.10.8-3.el8.x86_64 10/13
Verifying : docker-ce-cli-1:20.10.8-3.el8.x86_64 11/13
Verifying : docker-ce-rootless-extras-20.10.8-3.el8.x86_64 12/13
Verifying : docker-scan-plugin-0.8.0-3.el8.x86_64 13/13
Installed:
container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch
containerd.io-1.4.9-3.1.el8.x86_64
docker-ce-3:20.10.8-3.el8.x86_64
docker-ce-cli-1:20.10.8-3.el8.x86_64
docker-ce-rootless-extras-20.10.8-3.el8.x86_64
docker-scan-plugin-0.8.0-3.el8.x86_64
fuse-common-3.2.1-12.el8.x86_64
fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64
fuse3-3.2.1-12.el8.x86_64
fuse3-libs-3.2.1-12.el8.x86_64
libcgroup-0.41-19.el8.x86_64
libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64
slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64
Complete!
这个命令虽然已经安装了 docker,但是并没有启动(同服务器一样得启动后才运行)。
Step 4 - 启动 docker 并验证
[root@Iceland /]# systemctl start docker
[root@Iceland /]# docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
b8dfde127a29: Pull complete
Digest: sha256:7d91b69e04a9029b99f3585aaaccae2baa80bcf318f4a5d2165a9898cd2dc0a1
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon. # 客户连接守护进程
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.(amd64) # pull 镜像
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading. # 通过镜像生成容器运行
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal. # 守护进程将信息显示到终端
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
https://docs.docker.com/get-started/
上面最重要的信息就是解释了 docker 运行的4步,至此 docker 也安装完成
[root@Iceland /]# docker version
Client: Docker Engine - Community
Version: 20.10.8
API version: 1.41
Go version: go1.16.6
Git commit: 3967b7d
Built: Fri Jul 30 19:53:39 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.8
API version: 1.41 (minimum version 1.12)
Go version: go1.16.6
Git commit: 75249d8
Built: Fri Jul 30 19:52:00 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.9
GitCommit: e25210fe30a0a703442421b0f60afac609f950a3
runc:
Version: 1.0.1
GitCommit: v1.0.1-0-g4144b63
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Tips:阿里云镜像加速器
登录阿里云——>容器镜像服务——>镜像工具——>镜像加速器,将CentOS对应的4条命令复制运行即可。
[root@Iceland /]# sudo mkdir -p /etc/docker # 新建目录
[root@Iceland /]# sudo tee /etc/docker/daemon.json <<-'EOF' # 配置镜像地址文件
> {
> "registry-mirrors": ["https://lisay8ar.mirror.aliyuncs.com"]
> }
> EOF
{
"registry-mirrors": ["https://lisay8ar.mirror.aliyuncs.com"]
}
[root@Iceland /]# sudo systemctl daemon-reload # 重启守护进程
[root@Iceland /]# sudo systemctl restart docker # 重启 docker
Docker 网络配置
理解 docker0 桥接技术
[root@Iceland /]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo # 本机回环地址
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0 # 阿里云内网地址
valid_lft 315352421sec preferred_lft 315352421sec
inet6 fe80::216:3eff:fe29:ef40/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 # docker0 地址
valid_lft forever preferred_lft forever
inet6 fe80::42:5dff:fee9:e1b7/64 scope link
valid_lft forever preferred_lft forever
docker 里的每个容器都是通过与 docker0(类似路由器) 桥接从而在一个网段内进行通讯的。值得注意的是其中的某个容器都会和 docker0 搭建一对 evth-pair 虚拟接口,成对出现成对消失。这样才能保证容器之间既互相独立又能高效地互联互通,以及和外网通讯。
测试
[root@Iceland ~]# docker run -d -P --name tomcat01 tomcat # -P表示端口随机映射新建容器运行
Unable to find image 'tomcat:latest' locally
latest: Pulling from library/tomcat
1cfaf5c6f756: Pull complete
c4099a935a96: Pull complete
f6e2960d8365: Pull complete
dffd4e638592: Pull complete
a60431b16af7: Pull complete
4869c4e8de8d: Pull complete
9815a275e5d0: Pull complete
c36aa3d16702: Pull complete
cc2e74b6c3db: Pull complete
1827dd5c8bb0: Pull complete
Digest: sha256:1af502b6fd35c1d4ab6f24dc9bd36b58678a068ff1206c25acc129fb90b2a76a
Status: Downloaded newer image for tomcat:latest
b530e79cc32b45ed6222496013b66ab663eaef74c83dc62610b252b18d1a3310
[root@Iceland ~]# docker exec -it tomcat01 ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo # 本机回环地址
valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0 # eth 桥接地址 6 和 7 一对
valid_lft forever preferred_lft forever
[root@Iceland ~]# ping 172.17.0.2 # 直接可以通过地址从 Linux 命令行 ping 通到容器内部
PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.
64 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=0.101 ms
64 bytes from 172.17.0.2: icmp_seq=2 ttl=64 time=0.069 ms
64 bytes from 172.17.0.2: icmp_seq=3 ttl=64 time=0.064 ms
^C
--- 172.17.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2049ms
rtt min/avg/max/mdev = 0.064/0.078/0.101/0.016 ms
[root@Iceland ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0
valid_lft 315312613sec preferred_lft 315312613sec
inet6 fe80::216:3eff:fe29:ef40/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:5dff:fee9:e1b7/64 scope link
valid_lft forever preferred_lft forever
7: veth0a09b40@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default # 相较之前IP地址多出的这一个就是和创建容器对应的 7 号桥接地址
link/ether e6:88:6f:4a:e9:4c brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::e488:6fff:fe4a:e94c/64 scope link
valid_lft forever preferred_lft forever
docker 会给每个容器分配一对接口用于容器与桥接器通讯,利用这项技术也就可以实现容器之间相互隔离且能高效通讯,为后面部署 Slurm 集群的通讯实现打下基础。
容器之间利用link技术相互通讯
由于容器IP可能会变,所以希望能通过 --link 即用容器ID替代IP进行通讯
[root@Iceland ~]# docker run -d -P --name tomcat02 tomcat
07758a3a228c004fbf6cc8092b714d1249f921c4ba9360846206fc7915083f97
[root@Iceland ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
07758a3a228c tomcat "catalina.sh run" 5 seconds ago Up 4 seconds 0.0.0.0:49154->8080/tcp, :::49154->8080/tcp tomcat02
b530e79cc32b tomcat "catalina.sh run" 50 minutes ago Up 50 minutes 0.0.0.0:49153->8080/tcp, :::49153->8080/tcp tomcat01
[root@Iceland ~]# docker exec -it tomcat02 ping tomcat01
3ping: tomcat01: Name or service not known # 发现直接通过容器名在一个容器里无法连接另一个容器
# 通过增加运行时指令 --link 可以解决
[root@Iceland ~]# docker run -d -P --name tomcat03 --link tomcat02 tomcat
6e185946062f3af377eb58c34408471685cca20d8ca0b2873b24514856eda7d8
[root@Iceland ~]# docker exec -it tomcat03 ping tomcat02 # 通过指定03与02连接,发现可以互联
PING tomcat02 (172.17.0.3) 56(84) bytes of data.
64 bytes from tomcat02 (172.17.0.3): icmp_seq=1 ttl=64 time=0.131 ms
64 bytes from tomcat02 (172.17.0.3): icmp_seq=2 ttl=64 time=0.091 ms
64 bytes from tomcat02 (172.17.0.3): icmp_seq=3 ttl=64 time=0.076 ms
^C
--- tomcat02 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 53ms
rtt min/avg/max/mdev = 0.076/0.099/0.131/0.024 ms
# 但是反向02却ping不同03,因为双向都需要配置
通过命令查询 tomcat03 的桥接器信息,–link 就相当于在 hosts 配置中添加一行对 02 的单向映射。
[root@Iceland ~]# docker exec -it tomcat03 cat /etc/hosts
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.3 tomcat02 07758a3a228c # 这里就绑定了 02
172.17.0.4 6e185946062f
[root@Iceland ~]# docker exec -it tomcat02 cat /etc/hosts
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.3 07758a3a228c
从这里也可以看出 docker0 的不方便,–link是官方定义的有局限性,无法自定义,不可能每个方向都绑定一遍。而且docker0 不支持容器ID连接访问。
进阶搭建自定义网络
网络模式
- bridge:桥接模式 docker0(默认)
- none:不配置网络
- host:和宿主机共享网络
- container:容器网络连通(局限很大)
[root@Iceland ~]# docker network --help
Usage: docker network COMMAND
Manage networks
Commands:
connect Connect a container to a network
create Create a network # 通过 create 创建自定义桥接网络
disconnect Disconnect a container from a network
inspect Display detailed information on one or more networks
ls List networks
prune Remove all unused networks
rm Remove one or more networks
Run 'docker network COMMAND --help' for more information on a command.
# docker0 是默认域名不能访问
[root@Iceland ~]# docker rm -f $(docker ps -aq) # 首先将之前的容器及其网络配置删除
6e185946062f
07758a3a228c
b530e79cc32b
[root@Iceland ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[root@Iceland ~]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
tomcat latest 266d1269bb29 10 days ago 668MB
[root@Iceland ~]# ip addr # 可以看到这里已经只有最开始3行网络了
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0
valid_lft 315310384sec preferred_lft 315310384sec
inet6 fe80::216:3eff:fe29:ef40/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:5dff:fee9:e1b7/64 scope link
valid_lft forever preferred_lft forever
# 创建新的桥接网络,--driver [网络类型] --subnet [子网范围] --gateway [网关地址]
[root@Iceland ~]# docker network create --driver bridge --subnet 192.168.0.0/16 --gateway 192.168.0.1 mynet
57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e
[root@Iceland ~]# docker network ls
NETWORK ID NAME DRIVER SCOPE
9223b334e60a bridge bridge local
8d96801ccaf3 host host local
57c914464f0a mynet bridge local
a5ff794b6d74 none null local
[root@Iceland ~]# docker network inspect mynet
[
{
"Name": "mynet",
"Id": "57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e",
"Created": "2021-08-29T09:07:38.248210817+08:00",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": {},
"Config": [
{
"Subnet": "192.168.0.0/16", # 看到网络已经设置好了
"Gateway": "192.168.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {},
"Options": {},
"Labels": {}
}
]
测试自定义网络
[root@Iceland ~]# docker run -d -P --name tomcat-net-01 --net mynet tomcat
c2e8c4d6ec1af68bea8dcad213a9c693151859667f26336c596aedf4189aa898
[root@Iceland ~]# docker run -d -P --name tomcat-net-02 --net mynet tomcat
91ce2929f0083f0bba803fa12ccf11b1b0cff36b3c807ada42e5fbe1aadef1cb
[root@Iceland ~]# docker network inspect mynet
[
{
"Name": "mynet",
"Id": "57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e",
"Created": "2021-08-29T09:07:38.248210817+08:00",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": {},
"Config": [
{
"Subnet": "192.168.0.0/16",
"Gateway": "192.168.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"91ce2929f0083f0bba803fa12ccf11b1b0cff36b3c807ada42e5fbe1aadef1cb": {
"Name": "tomcat-net-02",
"EndpointID": "4df2dc1c5314bb02ae69ef7b47e32e658cb3aaaf7c65074bfddfe38629ba65be",
"MacAddress": "02:42:c0:a8:00:03",
"IPv4Address": "192.168.0.3/16", # 看到这里的IP就是我们定义的192.168.0.3
"IPv6Address": ""
},
"c2e8c4d6ec1af68bea8dcad213a9c693151859667f26336c596aedf4189aa898": {
"Name": "tomcat-net-01",
"EndpointID": "7d92fc552cb88f410b207075e473afde36f63020dc63f0de7923fd7137e19b1f",
"MacAddress": "02:42:c0:a8:00:02",
"IPv4Address": "192.168.0.2/16", # 看到这里的IP就是我们定义的192.168.0.2
"IPv6Address": ""
}
},
"Options": {},
"Labels": {}
}
]
自定义桥接网络好处是不同网络(不同子网)相互隔离,但是网络内互联互通,非常完善,两个容器可以相互 ping 通,修复了 --link 的问题。
[root@Iceland ~]# docker exec -it tomcat-net-01 ping 192.168.0.3
PING 192.168.0.3 (192.168.0.3) 56(84) bytes of data.
64 bytes from 192.168.0.3: icmp_seq=1 ttl=64 time=0.119 ms
64 bytes from 192.168.0.3: icmp_seq=2 ttl=64 time=0.092 ms
64 bytes from 192.168.0.3: icmp_seq=3 ttl=64 time=0.080 ms
^C
--- 192.168.0.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 64ms
rtt min/avg/max/mdev = 0.080/0.097/0.119/0.016 ms
[root@Iceland ~]# docker exec -it tomcat-net-02 ping 192.168.0.2
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.2: icmp_seq=2 ttl=64 time=0.101 ms
64 bytes from 192.168.0.2: icmp_seq=3 ttl=64 time=0.102 ms
^C
--- 192.168.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 27ms
rtt min/avg/max/mdev = 0.101/0.106/0.116/0.010 ms
[root@Iceland ~]# docker exec -it tomcat-net-02 ping tomcat-net-01 # 直接通过容器ID也可以
PING tomcat-net-01 (192.168.0.2) 56(84) bytes of data.
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=1 ttl=64 time=0.098 ms
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=2 ttl=64 time=0.098 ms
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=3 ttl=64 time=0.086 ms
^C
--- tomcat-net-01 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 51ms
rtt min/avg/max/mdev = 0.086/0.094/0.098/0.005 ms
Docker Compose——批量容器编排
官方文档:https://docs.docker.com/compose/
原来的单个容器的 docker 流程,docker file——>docker build——>docker run,需要手动操作单个容器,在面对大量集群时捉襟见肘,所以有了docker compose 通过配置文件实现多个容器的自动化操作
官方介绍
Using Compose is basically a three-step process:
- Define your app’s environment with a
Dockerfile
so it can be reproduced anywhere. - Define the services that make up your app in
docker-compose.yml
so they can be run together in an isolated environment. - Run
docker compose up
and the Docker compose command starts and runs your entire app. You can alternatively rundocker-compose up
using the docker-compose binary.
compose 的两个重要概念:
- 服务 services,容器,应用
- 项目 project,一组关联的容器
Step 1 - 安装 compose
# 先从 GitHub 下载 compose 文件
[root@Iceland ~]# sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
# 给文件授权
[root@Iceland ~]# sudo chmod +x /usr/local/bin/docker-compose
[root@Iceland ~]# docker-compose version # 证明安装成功
docker-compose version 1.29.2, build 5becea4c
docker-py version: 5.0.0
CPython version: 3.7.10
OpenSSL version: OpenSSL 1.1.0l 10 Sep 2019
Step 2 - 官网实例
官网实例为一个python 应用,计数器且使用 redis,因为服务器太慢一直下载错误就不演示了,大概步骤如下:
- step 1:写应用 app.py
- step 2:dockerfile 应用打包成镜像(单机应用没有上线)
- step 3:docker-compose yaml文件(定义整个服务,线上需要的环境)——核心文件
- step 4:启动 compose 项目(docker-compose up运行整套服务)
Slurm 集群搭建实验
参考blog:https://medium.com/analytics-vidhya/slurm-cluster-with-docker-9f242deee601
Step 1 -Slurm 架构说明
我们将使用 docker-compose 创建一个 Slurm 集群,它允许我们从 docker 镜像(作者构建的)创建一个环境。 Docker-compose 将创建容器和网络,以便在隔离的环境中进行通信。每个容器都是集群的一个组件。
- slurmmaster 是带有 slurmctld(Slurm 的中央管理守护进程)的容器。
- slurmnode[1-3] 是带有 slurmd(Slurm 的计算节点守护进程)的容器。
- slurmjupyter 是带有 jupyterlab 的容器。这允许使用 JupyterLab 作为集群客户端与集群进行交互。作为最终用户,我们将通过浏览器使用 JupyterLab 与 Slurm 进行交互。
- cluster_default 网络,docker-compose 将创建一个网络来加入并保留所有容器。网络内部的容器可以看到彼此。
以下方案显示了所有组件如何交互。
Step 2 写yaml 文件
由于均是使用镜像,所以整个项目只需要yaml文件,在里面定义了pull images 的各个步骤,运行只需在命令行输入 docker-compose up -d 即可。
# 新建文件夹 cluster 用来存放文件
[root@Iceland ~]# mkdir cluster
[root@Iceland ~]# ls
cluster composetest
[root@Iceland ~]# cd cluster
[root@Iceland cluster]# vim docker-compose.yml
docker-compose.yml 文件如下:
services:
slurmjupyter: # 开通容器 slurmjupyter
image: rancavil/slurm-jupyter:19.05.5-1 # 镜像仓库rancavil是作者名字 Rodrigo Ancavil 缩写
hostname: slurmjupyter
user: admin
volumes:
- shared-vol:/home/admin
ports:
- 8888:8888
slurmmaster:
image: rancavil/slurm-master:19.05.5-1
hostname: slurmmaster
user: admin
volumes:
- shared-vol:/home/admin
ports:
- 6817:6817
- 6818:6818
- 6819:6819
slurmnode1: # 定义容器1的各项参数
image: rancavil/slurm-node:19.05.5-1
hostname: slurmnode1
user: admin
volumes:
- shared-vol:/home/admin
environment:
- SLURM_NODENAME=slurmnode1
links:
- slurmmaster # 和之前自定义网络类似,这里定义 node1 能与 master 连接,下面同理
slurmnode2:
image: rancavil/slurm-node:19.05.5-1
hostname: slurmnode2
user: admin
volumes:
- shared-vol:/home/admin
environment:
- SLURM_NODENAME=slurmnode2
links:
- slurmmaster
slurmnode3:
image: rancavil/slurm-node:19.05.5-1
hostname: slurmnode3
user: admin
volumes:
- shared-vol:/home/admin
environment:
- SLURM_NODENAME=slurmnode3
links:
- slurmmaster
volumes:
shared-vol:
Step 3 - 运行 docker-compose up
[root@Iceland cluster]# docker-compose up -d # 开始部署,接下来是安装步骤
Creating network "cluster_default" with the default driver # docker-compose会自动按yaml生成自定义网络
Creating volume "cluster_shared-vol" with default driver
Pulling slurmjupyter (rancavil/slurm-jupyter:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-jupyter
83ee3a23efb7: Pull complete
db98fc6f11f0: Pull complete
f611acd52c6c: Pull complete
87f6e2c4791b: Pull complete
1301353d4fa3: Pull complete
3347f4fbce33: Pull complete
0cf1a37339f3: Pull complete
e78d0881f8c1: Pull complete
37049fe9d876: Pull complete
a8fa566a7a57: Pull complete
24af49ba4a2f: Pull complete
97b9029f86ee: Pull complete
Digest: sha256:17a72e8e4c5d687359c2923af7166e84f9bd3b63146145421bbac006ce141d45
Status: Downloaded newer image for rancavil/slurm-jupyter:19.05.5-1
Pulling slurmmaster (rancavil/slurm-master:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-master
83ee3a23efb7: Already exists
db98fc6f11f0: Already exists
f611acd52c6c: Already exists
87f6e2c4791b: Already exists
e216e1a311d3: Pull complete
ab998a26ee04: Pull complete
499f3426618c: Pull complete
b5b815649fa6: Pull complete
2f04debb872c: Pull complete
4050a9c6f8d3: Pull complete
Digest: sha256:1979f86166b58213380604dcd7c1fcdb2438a40c44add2ff356be47160a97ab3
Status: Downloaded newer image for rancavil/slurm-master:19.05.5-1
Pulling slurmnode1 (rancavil/slurm-node:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-node
83ee3a23efb7: Already exists
db98fc6f11f0: Already exists
f611acd52c6c: Already exists
87f6e2c4791b: Already exists
d82ef016a552: Pull complete
5865a097296e: Pull complete
0602a8c59a76: Pull complete
6f2545f38103: Pull complete
608c665d03da: Pull complete
c80540692f3b: Pull complete
Digest: sha256:ae650d12fbdaddd29208d7638aa0498c655bfe5a33f4fd07d57e51eb211f18c2
Status: Downloaded newer image for rancavil/slurm-node:19.05.5-1
Creating cluster_slurmmaster_1 ... done
Creating cluster_slurmjupyter_1 ... done
Creating cluster_slurmnode1_1 ... done
Creating cluster_slurmnode2_1 ... done
Creating cluster_slurmnode3_1 ... done
[root@Iceland cluster]# docker-compose ps # 可以看到 5 个容器都运行正常
Name Command State Ports
-------------------------------------------------------------------------------------------------------------
cluster_slurmjupyter_1 /etc/slurm-llnl/docker-ent ... Up 0.0.0.0:8888->8888/tcp,:::8888->8888/tcp
cluster_slurmmaster_1 /etc/slurm-llnl/docker-ent ... Up 3306/tcp,
0.0.0.0:6817->6817/tcp,:::6817->6817/tcp,
0.0.0.0:6818->6818/tcp,:::6818->6818/tcp,
0.0.0.0:6819->6819/tcp,:::6819->6819/tcp
cluster_slurmnode1_1 /etc/slurm-llnl/docker-ent ... Up 6817/tcp, 6818/tcp, 6819/tcp
cluster_slurmnode2_1 /etc/slurm-llnl/docker-ent ... Up 6817/tcp, 6818/tcp, 6819/tcp
cluster_slurmnode3_1 /etc/slurm-llnl/docker-ent ... Up 6817/tcp, 6818/tcp, 6819/tcp
记得将服务器的 IP地址:8888 输入浏览器即可看到我们运行的 JupyterLab 界面
这就是安装好的 Slurm 队列扩展功能
点击这个按钮,就进入 Slurm Queue 管理界面
在之前页面点击命令行按钮进入内部查看
admin@slurmjupyter:~$ scontrol show node # 查看节点信息,看见3个节点都在
NodeName=slurmnode1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUTot=1 CPULoad=0.31
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=slurmnode1 NodeHostName=slurmnode1 Version=19.05.5
OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021
RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=slurmpar
BootTime=2021-08-28T11:15:59 SlurmdStartTime=2021-08-29T06:38:14
CfgTRES=cpu=1,mem=1M,billing=1
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=slurmnode2 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUTot=1 CPULoad=0.31
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=slurmnode2 NodeHostName=slurmnode2 Version=19.05.5
OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021
RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=slurmpar
BootTime=2021-08-28T11:16:00 SlurmdStartTime=2021-08-29T06:38:15
CfgTRES=cpu=1,mem=1M,billing=1
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=slurmnode3 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUTot=1 CPULoad=0.31
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=slurmnode3 NodeHostName=slurmnode3 Version=19.05.5
OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021
RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=slurmpar
BootTime=2021-08-28T11:16:00 SlurmdStartTime=2021-08-29T06:38:15
CfgTRES=cpu=1,mem=1M,billing=1
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Step 4 -运行一个 Slurm 例子
先在 JupyterLab 新建一个文件并重命名为 test.py,并输入以下代码——简单让程序休眠15s
#!/usr/bin/env python3
import time
import os
import socket
from datetime import datetime as dt
if __name__ == '__main__':
print('Process started {}'.format(dt.now()))
print('NODE : {}'.format(socket.gethostname()))
print('PID : {}'.format(os.getpid()))
print('Executing for 15 secs')
time.sleep(15)
print('Process finished {}\n'.format(dt.now()))
继续新建一个脚本文件 job.sh 把工作分配给所有 slurmnode[1-3],这里指定将 test.py 进行广播 且任务数为 3,结果输出到文件 result.out
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=result.out
#
#SBATCH --ntasks=3
#
sbcast -f test.py /tmp/test.py
srun python3 /tmp/test.py
然后进入 Slurm Queue 管理界面,点击 Submit Job 提交集群需要做的工作,这里我们提交 job.sh 文件即可,类型选文件,路径是 /home/admin/job.sh,点击提交作业。
记得再点击一下 Reload 将作业装填到系统,这样工作就在集群跑起来了,过15s 左右侧边栏就会多出一个 result.out 的输出文件,双击查看就是 3 个计算结点并行计算得到的结果。
至此Slurm 的实例完成,受限于购买的免费服务器是1核2G的原因,无法提交更复杂的矩阵乘法。
最后记得关闭服务
[root@Iceland cluster]# docker-compose stop
Stopping cluster_slurmnode1_1 ... done
Stopping cluster_slurmnode2_1 ... done
Stopping cluster_slurmnode3_1 ... done
Stopping cluster_slurmjupyter_1 ... done
Stopping cluster_slurmmaster_1 ... done
[root@Iceland cluster]# docker-compose ps
Name Command State Ports
--------------------------------------------------------------------------
cluster_slurmjupyter_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmmaster_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode1_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode2_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode3_1 /etc/slurm-llnl/docker-ent ... Exit 137
填到系统,这样工作就在集群跑起来了,过15s 左右侧边栏就会多出一个 result.out 的输出文件,双击查看就是 3 个计算结点并行计算得到的结果。
[外链图片转存中...(img-mS70KiC7-1630323014459)]
>至此Slurm 的实例完成,受限于购买的免费服务器是1核2G的原因,无法提交更复杂的矩阵乘法。
最后记得关闭服务
```shell
[root@Iceland cluster]# docker-compose stop
Stopping cluster_slurmnode1_1 ... done
Stopping cluster_slurmnode2_1 ... done
Stopping cluster_slurmnode3_1 ... done
Stopping cluster_slurmjupyter_1 ... done
Stopping cluster_slurmmaster_1 ... done
[root@Iceland cluster]# docker-compose ps
Name Command State Ports
--------------------------------------------------------------------------
cluster_slurmjupyter_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmmaster_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode1_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode2_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode3_1 /etc/slurm-llnl/docker-ent ... Exit 137
特别感谢B站狂神的docker视频,“只要学不死,就往死里学”