一般没有额外配置时在容器中执行nvidia-smi会报错无法识别该命令,原因在于宿主机环境缺少了相关配置(并不是要在容器中再次安装一次NVIDIA驱动)。本文从配置NVIDIA驱动环境开始,从头开始讲述如何使容器支持GPU访问。
首先在宿主机按照笔者以下博文7.1节检查或配置NVIDIA驱动环境:
clGetPlatformIDs error -1001和OpenCL、CUDA安装_<SLF>的博客-CSDN博客
或使用以下命令:
$ sudo apt update
$ sudo apt install -y ubuntu-drivers-common
$ sudo add-apt-repository ppa:graphics-drivers/ppa -y
$ sudo ubuntu-drivers autoinstall
$ NvidiaVersion=`ubuntu-drivers devices | grep recommended | awk -F' ' '{print $3}'`
$ sudo apt install -y $NvidiaVersion
$ sudo apt update
# 安装完成后需要重启
# 若重启时出现画屏、黑屏等异常现象,解决方法可参照上面笔者博文7.1节(最好提前规避)
NVIDIA驱动安装完成后,在终端命令行执行以下命令检查nvidia-container-toolkit是否安装:
$ which nvidia-container-toolkit
/usr/bin/nvidia-container-toolkit #有此打印表示已安装
$ dpkg -s nvidia-container-toolkit
Package: nvidia-container-toolkit
Status: install ok installed
Priority: optional
Section: utils
Installed-Size: 4344
Maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
Architecture: amd64
Version: 1.9.0-1
Replaces: nvidia-container-runtime (<= 3.5.0-1), nvidia-container-runtime-hook
Depends: libnvidia-container-tools (>= 1.9.0-1), libnvidia-container-tools (<< 2.0.0), libseccomp2
Breaks: nvidia-container-runtime (<= 3.5.0-1), nvidia-container-runtime-hook
Conffiles:
/etc/nvidia-container-runtime/config.toml 9b4a0ffec803274ee7ced18057933bfb
Description: NVIDIA container runtime hook
Provides a OCI hook to enable GPU support in containers.
Homepage: https://github.com/NVIDIA/nvidia-container-runtime/wiki
未安装时使用以下命令安装:
sudo apt update
sudo apt install -y curl #未安装时需要安装
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
检查宿主机是否存在目录文件 /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json,若不存在使用以下命令创建:
Content=`cat << 'EOF'
{
"version": "1.0.0",
"hook": {
"path": "/usr/bin/nvidia-container-toolkit",
"args": ["nvidia-container-toolkit", "prestart"],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]
},
"when": {
"always": true,
"commands": [".*"]
},
"stages": ["prestart"]
}
EOF`
HookFile=/usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
sudo mkdir -p `dirname $HookFile`
sudo echo "$Content" > $HookFile
执行podman --help时可以看到如下选项就明白为何创建上面的文件:
修改配置允许用户以普通用户权限执行和修改CUDA容器:
sudo sed -i 's/^#no-cgroups = false/no-cgroups = true/;' /etc/nvidia-container-runtime/config.toml
注意:该步操作很重要,若不执行此步,启动或创建容器时将报错
error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1)
//创建容器时还可指定命令选项 "--security-opt=label=disable"
//ldconfig.real
最后使用podman照常启动容器,在容器中执行nvidia-smi可以正常识别并打印:
容器中无CUDA时正常下载安装即可(若是第一次创建容器,可直接从NVIDIA官方仓库下载包含特定CUDA版本的OS镜像)。
郑重提示:①本文不允许转载,若认可本文,可点赞收藏关注。
②若有疑问,可在评论区留言相互讨论。