利用LXD搭建多人共用GPU服务器
环境:
显卡: GTX 1080ti (数量:1)
CPU: Intel i7-9700K (数量:1)
主板: 技嘉Z390-D
系统版本: Ubuntu 20.04.2 LTS
首先说一个问题:
一开始想先在旧系统(系统版本同上,已经用了小半年,各种环境变量、配置均已更改,未安装CUDA,安装源为阿里源)上安装试一下,但是执行如下安装命令:
sudo snap install lxd
# 或
sudo apt install lxd
一直报错如下:
error: unable to contact snap store
于是乎,只能重装系统,重装系统后再次执行安装命令,成功安装LXD。
至于原因,因为国内访问snap store一直都是个问题,个人猜测是使用了国内源的问题。废话不多说,正文开始。
1. 安装LXD及其组件
# 安装LXD
sudo snap install lxd
# 安装zfs及bridge-utils
sudo apt install zfsutils-linux bridge-utils
我们需要安装LXD实现虚拟容器,ZFS作为LXD的存储管理工具,bridge-utils用于搭建网桥。由于apt安装的LXD不是最新版本,这里使用snap安装工具安装LXD。
2.初始化 LXD
sudo lxd init
在初始化过程中,不要创建新的网桥,ZFS设置大小要尽量大,其他设置默认即可。详情如下:
Would you like to use LXD clustering? (yes/no) [default=no]:
Do you want to configure a new storage pool? (yes/no) [default=yes]:
Name of the new storage pool [default=default]: lxd
Name of the storage backend to use (ceph, btrfs, dir, lvm, zfs) [default=zfs]:
Create a new ZFS pool? (yes/no) [default=yes]:
Would you like to use an existing empty block device (e.g. a disk or partition)? (yes/no) [default=no]:
Size in GB of the new loop device (1GB minimum) [default=30GB]: 800
Would you like to connect to a MAAS server? (yes/no) [default=no]:
Would you like to create a new local network bridge? (yes/no) [default=yes]: no
Would you like to configure LXD to use an existing bridge or host interface? (yes/no) [default=no]: yes
Name of the existing bridge or host interface: br0
Would you like the LXD server to be available over the network? (yes/no) [default=no]:
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]:
需要注意的是:
Name of the existing bridge or host interface: br0
中的 br0 有的主机不一定有网桥,但前面说了“在初始化过程中不要创建新的网桥”,因此这里我们可以用本机已有的网卡名称暂时进行代替。
使用如下命令查看本机的网卡名称:
ifconfig
我这里显示如下:
wuladuizhang@wuladuizhang:~$ ifconfig
enp4s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.120 netmask 255.255.255.0 broadcast 192.168.1.255
inet6 fe80::1172:fdf3:78cf:18e6 prefixlen 64 scopeid 0x20<link>
ether 18:c0:4d:61:39:43 txqueuelen 1000 (Ethernet)
RX packets 56005 bytes 79580943 (79.5 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 14295 bytes 1389545 (1.3 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 407 bytes 36567 (36.5 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 407 bytes 36567 (36.5 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
即第一个网卡信息enp4s0
。至于网络配置可以见第4节。
3. 新建容器
如果网速允许可以尝试:
sudo lxc launch ubuntu:xenial yourContainerName
如果网速不行可以添加清华大学的镜像:
sudo lxc remote add tuna-images https://mirrors.tuna.tsinghua.edu.cn/lxc-images/ --protocol=simplestreams –public
提示:
Ubuntu 18.04使用上述命令添加清华源,但是在Ubuntu 20.04上报错:
To start your first instance, try: lxc launch ubuntu:18.04
Description:
Add new remote servers
URL for remote resources must be HTTPS (https://).
Basic authentication can be used when combined with the "simplestreams" protocol:
lxc remote add some-name https://LOGIN:PASSWORD@exampl