Python development using GPU Docker with SSH on Ubuntu 22.04

《Docker 技术入门与实战》在线版

Install apt packages without root privileges

[JuNest] https://github.com/fsquillace/junest
[pget] https://github.com/0x00009b/pkget

Install Python packages without root privileges

pip install -e . # install setup.py
You can use the target (t) flag of pip install to specify a target location for installation.
pip install -r requirements.txt -t .

Install CUDA SDK & CUDNN

Install CUDA SDK without GPU driver

This is because that the driver including in the CUDA SDK may cause that the Ubuntu OS is enable to boot from GUI. This is arise from the confict between the open source GPU driver and the driver including in the CUDA SDK.

Install CUDA + CUDNN without root privileges

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pip-wheels
(1) Install Previous CUDA and CUDNN release with conda

conda install cuda -c nvidia/label/cuda-11.7.0

(2) Install Previous CUDA and CUDNN release with pip

python3 -m pip install --upgrade setuptools pip wheel
python3 -m pip install nvidia-cuda-runtime-cu11
pip install nvidia-cublas-cu11 nvidia-cufft-cu11 nvidia-curand-cu11 nvidia-cusolver-cu11 nvidia-cusparse-cu11

Add the directory of CUDA & CUDNN into ~/.bashrc

export CUDAPTAH=/home/chengzi/.local/lib/python3.9/site-packages
export LD_LIBRARY_PATH=$CUDAPTAH/nvidia/cudnn/lib/:$CUDAPTAH/nvidia/cuda_runtime/lib/:$CUDAPTAH/nvidia/cufft/lib/:$CUDAPTAH/nvidia/cusparse/lib/:$CUDAPTAH/nvidia/cublas/lib/:$LD_LIBRARY_PATH

Install and configure GPU Docker v23.0+

Install Docker Engine on Ubuntu

sudo apt remove docker docker-engine docker.io containerd runc  #uninstall the old version
sudo apt autoremove
sudo apt update
sudo apt install ca-certificates curl gnupg
#sudo apt install apt-transport-https ca-certificates curl gnupg-agent software-properties-common 
# sudo add-apt-repository "deb [arch=amd64] https://mirrors.ustc.edu.cn/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
# Add Docker’s official GPG key:
sudo mkdir -m 0755 -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
#curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
#sudo apt-key fingerprint 0EBFCD88
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
docker info  #Show basic infomation about your Docker configuration
sudo nvidia-ctk runtime configure --runtime=docker
sudo service docker restart

Configure the Docker daemon
To configure the Docker daemon using a JSON file, create a file at /etc/docker/daemon.json on Linux systems,

{"builder": {
    "gc": {
      "defaultKeepStorage": "8GB",
      "enabled": true}  },
  "features": {"buildkit": true}}

Fixing “successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero” problem
https://gist.github.com/zrruziev/b93e1292bf2ee39284f834ec7397ee9f

Usage of Docker

Using non-privileged user to start docker

sudo gpasswd -a $USER docker     #将登陆用户加入到docker用户组中
newgrp docker     #更新用户组
docker ps    #测试docker命令是否不可以使用sudo正常使用

Download container images

# Create an image with an existing container hub 
# https://hub.docker.com/r/nvidia/cuda/tags?page=1&name=11.8
docker pull nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
# Include the --gpus flag when you start a container to access GPU resources. Specify how many GPUs to use.
# Ensure the nvidia-container-runtime-hook is accessible from $PATH.
$ which nvidia-container-runtime-hook
# 查看–gpus 参数是否安装成功
$ docker run --help | grep -i gpus

Create GPU container with SSH connect support

[ssh连接docker容器] https://blog.csdn.net/winter2121/article/details/118223637
[TIPS]

  • 从docker 1.9之后,不需要单独去下nvidia-docker这个独立的docker应用程序,gpu docker所需要的Runtime被集成进docker中,使用的时候用–gpus参数来控制。
  • Fix nvidia-smi/nvtop PIDs when used from within container
    In version 1.5 and later of Docker, you can make the host’s process ID namespace visible from inside a container by specifying the --pid=host option to docker run.

(1) Create and Start a container to run CUDA computing with SSH

# 拉一个GPU docker的正确姿势, 使用后台模式【-d】创建并启动守护式的容器container,必须要对外暴露端口 
$ docker run --init -itd --gpus all --name 容器名 -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all -p [portid] 镜像名[ImageID]
# 多出来的是:NVIDIA_DRIVER_CAPABILITIES=compute,utility
# 如果你不改这个环境变量,宿主机的nvidia driver在容器内是仅作为utility存在的,如果加上compute,宿主机的英伟达driver将对容器提供计算支持(cuda支持)。
# 使用所有GPU
$ docker run --gpus all --name cu116u4 -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -dit -p 8022:22 ImageID[623b42ee7f52]
# 使用指定GPU运行, and Limit CPU and Memory Usage in Docker Containers
docker run --init --cpus=8 --memory=64G --gpus '"device=0,1"' --name cu117u2204 -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -dit -p 8022:22 ImageID[623b42ee7f52]
# 在服务器(宿主机)上(不是服务器的docker里)测试新建docker容器中哪个端口转发到了服务器的22端口:
docker port [your_container_name]cu117u2204 22
# 如果前面的配置生效了,会看到输出  # 0.0.0.0:8022
# docker exec进入容器,再次运行nvidia-smi, 和宿主机的输出就完全相同了。
# 再次尝试pytorch的测试代码,输出为True。至此,就获得了一个具有nvidia driver和cuda支持的docker。
# [****] Specify an init process to avoid zombie processes(僵尸进程)
# -p参数把容器的22端口映射到了宿主机的8022端口。假设宿主机ip为1.1.1.1,则直接ssh访问1.1.1.1的8022端口,就相当于访问这个容器环境。
# In version 1.5 and later of Docker, you can make the host's process ID namespace visible (including cmd nvidia-smi) from inside a container by specifying the --pid=host option to docker run.
docker run --init --pid=host --gpus '"device=0,1"' --name cu113u4 -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -dit -p 8022:22 ImageID[623b42ee7f52]
# Docker Mount a Local Directory using -v or -mount (https://docs.docker.com/storage/bind-mounts/)
--mount type=bind,source=/home/rock,target="$(pwd)"/tmp
-v /raid/data/rock:"$(pwd)"/tmp  #-v <source[@host pc]>:<target[@container]># 
[推荐方法*****] docker run --init --cpus=10 --memory=128G --gpus '"device=5,6"' --name cu117u2204 -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -v /raid/data/****:/home  -dit -p 8022:22 [9c9baf5d5194]
# publish multiple ports of containers to the host PC
docker run --init --name web2204 -v /mnt/8TB-HDD/l***/web:/home -dit -p 8021:22 -p 3305:3306 -p 8027:8087 -p 8029:8009 -p 3465:465 -p 3380:80 -p 3379:6379 [1f6ddc1b2547]
# RuntimeError: DataLoader worker (pid 25795) is killed by signal: Bus error.
docker run --init --cpus=10 --memory=128G --shm-size=32G --gpus '"device=5,6"' --name cu117u2204 -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -v /raid/data/****:/home  -dit -p 8022:22 [9c9baf5d5194]
# ERR1[不建议启用--privileged]: umount: loop3/: must be superuser to umount in Docker. In default docker run it's not a real Operating system as we expect. It doesn't have permission to access the devices. So we have to use --privileged While running a docker.
# ERR2[SOLUTION:建议使用service命令替代systemctl]: System has not been booted with systemd as init system (PID 1). Can't operate. Failed to connect to bus: Host is down.
docker run --init --pid=host --privileged[=true] --gpus '"device=0,1"' --name cu113u4 -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -dit -p 8022:22 ImageID[623b42ee7f52]

(2) Changing shmem size of a docker container
RuntimeError: DataLoader worker (pid 25795) is killed by signal: Bus error.
由于在docker镜像中默认限制了shm(shared memory),然而数据处理时pythorch则使用了shm。这就导致了在运行多线程时会将超出限制的DataLoader并直接被kill掉。
**解决办法:**在创建Docker容器时,添加 –shm-size 参数设置,以改变默认的shared memory size
Docker containers are allocated 64M of shared memory by default. The option --shm-size is used to set the required size for /dev/shm within the container.
Running a BusyBox container with default settings.

~ $ docker run -it busybox sh
/ # df -h /dev/shm
Filesystem                Size      Used Available Use% Mounted on
shm                      64.0M         0     64.0M   0% /dev/shm

Launching a new container with increased shmem size.

~ $ docker run --shm-size=256m -it busybox sh
/ # df -h /dev/shm
Filesystem                Size      Used Available Use% Mounted on
shm                     256.0M         0    256.0M   0% /dev/shm

(3) Install and configure ssh tools in the container

docker exec -it  [GPUcontainerID]bff419356d91 /bin/bash   # Run a running container [on the host PC]
# Run the following commands in the existing GPU container
$ apt update
$ apt install net-tools vim openssh-server python3-pip rsync screen locales dialog sudo
$ sudo locale-gen en_US.UTF-8  
# 在docker容器内,编辑文件/etc/ssh/sshd_config,添加一行PermitRootLogin yes表示ssh允许root登录。
echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
# 或者 vim /etc/ssh/sshd_config 并手敲一行PermitRootLogin yes
# [Optional]To update your server (and restart your sshd)
echo "ClientAliveInterval 60" | sudo tee -a /etc/ssh/sshd_config
echo "ClientAliveCountMax 6" | sudo tee -a /etc/ssh/sshd_config
# To fix "X11 forwarding request failed on channel 0"
echo "X11Forwarding yes" | sudo tee -a /etc/ssh/sshd_config
echo "AddressFamily inet" | sudo tee -a /etc/ssh/sshd_config
[or X11UseLocalhost no] ===NOT SUGGESTED SULOTION===
# Restart the SSH server for the new configuration to take effect:
sudo /etc/init.d/ssh force-reload
# sudo /etc/init.d/ssh restart  [OR]  sudo service ssh restart
service ssh restart  # 重启ssh服务方可使用ssh连接重启后的container
passwd root  # 在docker容器内,初始化root密码,用于下一步的登录。
# 本地或任何一台可以访问宿主机的电脑可通过ssh+端口号访问容器
ssh -X root@[Host IP] -p 8022
env  #check the value of the variable DISPLAY in the current PC

(3) Launch, stop, and restart a container

#Enter a running container (run commands interactively in an exisitng container)
docker exec -it <container_ID_or_name> /bin/bash
docker start -i <container-name>    #Start an existing container
docker stop <container-name>    #Stop an existing container
docker restart my_container # Restart a container

(4) Enable SSH connection on Host PC

nvidia-smi -L  # Check GPU info
nvidia-smi -q  # Check GPU info

netstat -ntlp   //查看当前所有tcp端口
# 错误信息:ssh: connect to host localhost port 22: Connection refused
错误原因: 1.sshd 未安装   2.sshd 未启动   3.防火墙  4.需重新启动ssh服务
sudo apt install openssh-client #确定安装sshd
sudo net start sshd  # 启动sshd
$ sudo ufw disable   # 检查防火墙设置,关闭防火墙
# 运行 ps -e | grep ssh,查看是否有sshd进程:
sudo service ssh restart

(5) Modify the ports of a running container without deleting it

  • Using docker inspect [containerID] get details about current port mapping.
docker port [containerID]      # check the container's published ports
docker inspect [containerID]   # check configurations of this container
  • Stop the container before editing the below files, docker stop [containerID]
  • Change the port mapping by updating the PortBindings entry in the container hostconfig.json file, found at var/lib/docker/containers/[hash_of_the_container]/hostconfig.json
    Within the PortBindings section, either edit the existing HostPort to the port you would like, or add them yourself (see below)
# This will be seen under “NetworkSettings”. And “PortBindings” under “HostConfig”.
"PortBindings": {
    "443/tcp": [
        {
            "HostIp": "",
            "HostPort": "2443"
        }
    ],
    "80/tcp": [
        {
            "HostIp": "",
            "HostPort": "8022"
        }
    ]
}
  • Edit the config.v2.json file by update the ExposedPorts and Ports sections as shown below
$ vi /var/lib/docker/containers/[containerID]/config.v2.json
...
{
"Config": {
....
"ExposedPorts": {
"80/tcp": {},
"8888/tcp": {}
},
....
},
"NetworkSettings": {
....
"Ports": {
 "80/tcp": [
 {
 "HostIp": "",
 "HostPort": "80"
 }
 ],
 "8888/tcp": [
 {
 "HostIp": "",
 "HostPort": "8888"
 }
 ]
 },
....
}
  • Restart docker engine (to flush/clear config caches): service docker restart
  • Start up the container: docker start [containerID]

(6) Install and launch the service of MariaDB

sudo apt install -y mariadb-server mariadb-client
mariadb --version
sudo service mariadb status 
sudo service mariadb start  # Start And Enable MariaDB
sudo service mariadb restart

(7) Changing the Timezone of an Existing Docker Container
To change the timezone in an already running Docker container, you need to perform a few additional steps. First, get into the Docker container’s shell using the docker exec command.
Once you’re in the container’s shell, install the tzdata package. tzdata is a time zone and daylight-saving time database used by several systems (like UNIX systems).
apt update && apt install -y tzdata
After installing tzdata, you can reconfigure it to set the timezone.
dpkg-reconfigure tzdata
This command will open a simple GUI where you can select the geographical area and then the city to set the timezone.

Add a new user with sudo in Docker container

# Log into the system with a root user or an account with sudo privileges.
apt install sudo # fix `sudo command not found`
adduser [newuser] # add a new user
# To grant the new user elevated privileges, add them to the sudo group.
usermod -aG sudo newuser
# [Optional] usermod -aG root newuser
groups newuser   # Verify User Belongs to Sudo Group
su - newuser    # Use the newly add user

Upload files to a Docker container

  • Copy a file to a running Docker container
    Copy the file missing_data.sql to the running Docker container and locate it in the directory /.
docker cp missing_data.sql <container-id>:/missing_data.sql
  • Copy copy multiple files from a folder in the host machine to the container
docker cp src/directory/. <container-id>:/target/directory/

Manage docker images or containers

(1) Check out the running containers & images

docker ps -a  #List the all running containers
docker image ls  # List installed images

Duplicate a running docker container

To clone/duplicate a container and its data into a new one, you can use docker commit and create a snapshot the container use docker commit : to create snapshot and save it as an image. Again use docker images to view the saved image.

# create a new image from that container using the docker commit command
docker commit bff419356d91 nvidia/cuda:11.6-pyscf-ubuntu20.04
# then start a new container from the newly created image (saved snapshot)
docker run -it <IMAGE ID> /bin/bash

Copy a Running Docker Container to Another Host

Docker image migration to move containers from one host to another.

# docker commit -p=false NAME_OF_INSTANCE mycontainerimage
docker commit NAME_OF_INSTANCE mycontainerimage
# Save the docker image into an archive:
docker save image_name > image_name.tar
docker save -o /home/sammy/your_image.tar your_image_name
# save this image to a file and compress it
docker save mycontainerimage | gzip > mycontainerimage.tar.gz
# copy to another host using rsync or scp
scp -rP 8022 /root/pyscf21 root@147.8.234.17:/root
#Once you have your .tar file copied over to your new server, SSH to the new server and load the Docker image:
gunzip -c mycontainerimage.tar.gz | docker load
sudo docker load -i your_image.tar
#Then, in order to check if this was successful, you can run docker images to see the list of the available images:
sudo docker images
docker run -d --name=PICK_NAME_FOR_CONTAINER mycontainerimage

Remove docker images or containers

  • images: 虚拟机镜像,相当于一个模板
  • container:是image的运行时状态,docker对于运行时的image都保留一个状态container
  • image会被containers引用(拿来运行),故在containers is not removed, and its corresponding image cannot be removed.
  • Delete the running containers
docker rm -f <container_ID_or_name>   #rmove a running container
docker rm $(docker ps -aq)   #remove containers to clean up disk space
  • Delete docker images
docker rmi -f <image_ID_or_name>   
docker rmi $(docker image ls -aq)   # remove the huge files
docker rmi -f $(docker images -a -q)
docker image prune  #clean up huge images
  • Delete all docker containers or images
docker rm -f $(docker ps -aq)   #remove all containers to clean up disk space
docker image prune -a  #clean up all images
  • Clean up space used by Docker
docker system prune -a  # [Not to use!!!] remove all stopped containers & all images without at least one container associated to them
docker volume prune 

Connect to a remote Tensorboard server

Run Tensorboard on localhost

$ pip install soundfile future tensorboard
#运行 tensorboard --logdir=/tmp --bind_all 时报错:tensorboard: command not found
$ pip3 show tensorboard
Name: tensorboard
Version: 2.10.1
Location: /usr/local/lib/python3.8/dist-packages
# 找到location位置, 然后可以在里面发现main.py文件
python3 main.py --logdir=/tmp --port 6006 即可。
$ 利用alias设置命令别名
vim ~/.bashrc
在文件写入
alias tensorboard='python path-to-tensorboard/main.py'
#alias tensorboard='python /usr/local/lib/python3.8/dist-packages/tensorboard/main.py'
$ source ~/.bashrc
# 然后就可以用tensorboard 命令了, --bind_all的作用是to expose to the network
tensorboard --logdir='path_to_data' --port 6006 --bind_all
# If everything works properly, terminal will show:
TensorBoard 2.10.1 at http://bff419356d91:6006/ (Press CTRL+C to quit)

Run Tensorboard on localhost

How to locally view tensorboard of remote server

# Open a new Terminal tab on local PC and create a tunnel:
ssh -NfL 6006:localhost:6006 $USER@remote-server-ip [-p port]
# 1 run the command to start Tensorboard on the remote server
tensorboard --logdir=/tmp/runs --port  6006 --bind_all
# 2 run the command to create a tunnel
ssh -NfL  6006:localhost:6006  3090[USER@remote-server-ip [-p prot]]
# 3 close Tensorboard ==> find the tensorbroad process and terminate it
ps -ef|grep tensorboard  # Get the running tensorboard process details
kill -9   #Kill the process using pid
# 4 To display different runs in TB, we put logs of specific experiments in subfolders. TB will create various subfolders when we run it.
writer = SummaryWriter(tbroot_dir + "/" + model_name + exp_tag)

Finally, open the localhost url (http://localhost:6006/) in a browser, where all the aforementioned plots will be shown.
Got to http://localhost:$LOCAL_PORT in your laptop’s browser to access the tensorboard page.

SSH & Screen in Linux

Fixing broken pipe error with SSH

ssh_config(5)

# 编辑文件/etc/ssh/sshd_config,添加一行PermitRootLogin yes表示ssh允许root登录。
echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
# ERR: client_loop: send disconnect: Broken pipe
# To update your server (and restart your sshd)
echo "ClientAliveInterval 160" | sudo tee -a /etc/ssh/sshd_config
echo "ClientAliveCountMax 6" | sudo tee -a /etc/ssh/sshd_config
[OPTIONAL, The default is "yes"] echo "TCPKeepAlive yes" | sudo tee -a /etc/ssh/sshd_config
# Restart the SSH server for the new configuration to take effect:
sudo /etc/init.d/ssh force-reload
sudo /etc/init.d/ssh restart   # OR $ sudo service ssh restart
# OR client-side:
echo "ServerAliveInterval 60" >> ~/.ssh/config
echo 'ServerAliveCountMax 20' | tee -a ~/.ssh/config
[OPTIONAL, The default is "yes"] echo "TCPKeepAlive yes" | tee -a ~/.ssh/config

screen

With screen, you can start a window session, detach it so it’s still running in the background, log off or back in, and reattach the session.

Ctrl a+d  # Detaching Linux Screen Session
screen ls  # listing the running screen sessions
screen -r 10835 # restore the 10835-screen session
screen -S monitor # start a screen session called “monitor”
screen -wipe # Remove dead screens

Sharing a screen Session
You can also use a screen session to allow two people to see and interact with the same window. Let’s say someone running Fedora on his computer wants to connect to our Ubuntu server.

# one person starts a screen session called “ssh-geek” using the -S (session name) option. He also uses the -d (detach) and -m (enforced creation) options to create a new screen session that’s already detached.
screen -d -m -S ssh-geek
# uses the -X (multiscreen mode) option to join the same window session
screen -x ssh-geek
# Another person uses the -X (multiscreen mode) option to join the same window session, like so:
screen -X ssh-geek
Now, anything either person types, the other will see. Both people are now sharing a screen session that’s running on a remote Ubuntu computer.

Troubleshooting

  • Run python GUI code in the remote server on localhost PyCharm
    [Error] X11-Forward is enable or not supported by server, “qt.qpa.plugin Could not load the Qt platform plugin xcb in even though it was found.”
    [Solved]
sudo apt install qttools5-dev-tools qttools5-dev libqt5designer5 python3-pyqt5 
sudo apt install xserver-xephyr xorg openbox eric 
sudo apt install libxcb-xinerama0 libxcb-image0 libxcb-keysyms1 libxcb-render-util0 libxcb-xkb1 libxkbcommon-x11-0
[sudo apt-get install ubuntu-desktop]
echo "X11Forwarding yes" | sudo tee -a /etc/ssh/sshd_config
echo "AddressFamily inet" | sudo tee -a /etc/ssh/sshd_config
sudo service ssh restart  #make the modified sshd_config enable

ssh -X userneame@serverIP -p port
env # check the value of environment variable DISPLAY, e.g., DISPLAY=localhost:10.0

Test: ssh -X username@serverIP -p port
In PyCharm, select the python file -> right click -> More run/debug -> Modify Run Configuration, add PYTHONUNBUFFERED=1;DISPLAY=localhost:10.0 [adjust its value according the value getted with env] into Environment veriables.

  • Run python GUI code in the remote server on localhost VS Code

  • opam init in docker
    [Error] bwrap: Creating new namespace failed: Operation not permitted
    [Error] You need to either disable opam sandboxing in the Docker image (opam init --disable-sandboxing), or run the Docker container with docker run --privileged in order to allow the container to create its own namespaces.

  • warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
    Solved

$ sudo apt install locales
$ sudo locale-gen en_US.UTF-8  
$ [OPTIONAL] sudo dpkg-reconfigure locales   

In the last step you, would see a text based UI, select en_US.UTF-8 by moving using up and down arrow and selecting via spacebar or typing ID of en_US.UTF-8 is 159.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值