1. 在GPU服务器上安装cuda程序及驱动
执行:
lspci | grep -i nvidia 确定是否存在GPU,如果提示lspci命令不存在,则执行yum install pciutils -y
2. 安装NVIDIA、epel的rpm仓库
执行:
wget https://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/e/epel-release-7-12.noarch.rpm
rpm -ivh cuda-repo-rhel7-10.1.243-1.x86_64.rpm
rpm -ivh epel-release-7-12.noarch.rpm
yum clean all
3. 安装cuda及驱动,包较多,需要时间很长
yum -y install nvidia-driver-latest-dkms cuda cuda-drivers
如果中间报错缺包:
libvdpau(x86-64),vulkan-filesystem
则执行
wget http://mirror.centos.org/centos/7/os/x86_64/Packages/libvdpau-1.1.1-3.el7.x86_64.rpm
wget http://mirror.centos.org/centos/7/os/x86_64/Packages/vulkan-filesystem-1.1.97.0-1.el7.noarch.rpm
yum install vulkan-filesystem-1.1.97.0-1.el7.noarch.rpm libvdpau-1.1.1-3.el7.x86_64.rpm
4. 安装nvidia-docker
执行:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-container-toolkit nvidia-container-runtime
sudo systemctl restart docker
执行如下命令,测试是否安装正确:
docker run --gpus '"device=0"' nvidia/cuda:10.0-base nvidia-smi
注意:docker升级到19.03以后,nvidia将提供原生的显卡支持,只需要安装nvidia-container-toolkit工具包即可,不再像使用nvidia-docker/2那样复杂配置,而且不支持用docker-compose
5. 安装nvidia的k8s-device-plugin
首先配置/etc/docker/daemon.json为如下:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
之后,重启docker服务
执行如下命令:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.6.0/nvidia-device-plugin.yml
启动插件
执行测试pod
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:10.0-base
resources:
limits:
nvidia.com/gpu: 1
编写测试yaml,执行kubectl create -f gputest.yml
发现pod调度到GPU服务器执行,并显示结果说明成功