目标:将GPU加入k8s集群,并创建使用GPU的容器
环境:kubernetes 1.11
步骤:Device Plugin概述->GPU节点驱动与nvidia-docker安装配置->docker runtime配置->kubernetes组件安装并加入集群->kubelet配置->nvidia-device-plugin部署->运行测试
1.Device Plugin概述
Kubernetes v1.8 开始增加了 Alpha 版的 Device 插件,用来支持 GPU、FPGA、高性能 NIC、InfiniBand 等各种设备。这样,设备厂商只需要根据 Device Plugin 的接口实现一个特定设备的插件,而不需要修改 Kubernetes 核心代码。
Device 插件实际上是一个 gPRC 接口,需要实现 ListAndWatch() 和 Allocate() 等方法,并监听 gRPC Server 的 Unix Socket ,在 /var/lib/kubelet/device-plugins/ 目录中。
2.GPU节点驱动与nvidia-docker安装配置
此处节点已完成驱动和nvidia-docker的安装,后续需要再补充。
nvidia-smi
nvidia-docker version
3.docker runtime配置
修改docker runtime,指定为nvidia,此处需要优先安装nvidia-containter-runtime
(1)搜索nvidia-container-runtime版本
yum search --showduplicates nvidia-container-runtime
因为GPU节点的docker版本为1.12.6,因此选用对应的版本
(2)安装
yum --setopt=obsoletes=0 install nvidia-container-runtime-1.1.0-1.docker1.12.6.x86_64
安装完成后在/usr/bin/目录下
(3)docker runtime修改
vim /lib/systemd/system/docker.service
修改如下部分:
重启服务:
systemctl daemon-reload
systemctl restart docker.service
4.kubernetes组件安装并加入集群
由于集群以kubeadm方式部署,因此新的节点需要安装node组件:
yum -y install kubeadm kubelet kubectl
加入集群:
kubeadm join命令,详情参考部署博文:https://blog.csdn.net/xingyuzhe/article/details/80507384
5.kubelet配置
1.11版本的kubelet配置在文件/etc/sysconfig/kubelet中
增加配置:
KUBELET_EXTRA_ARGS=--fail-swap-on=false --cadvisor-port=4194 --feature-gates=DevicePlugins=true
重启服务:
systemctl daemon-reload
systemctl restart kubelet
6.nvidia-device-plugin部署
创建部署文件:nvidia-device-plugin.yml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
template:
metadata:
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
# reserves resources for critical add-on pods so that they can be rescheduled after
# a failure. This annotation works in tandem with the toleration below.
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists
containers:
- image: nvidia/k8s-device-plugin:1.11
name: nvidia-device-plugin-ctr
securityContext:
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
在master节点上执行:kubectl create -f nvidia-device-plugin.yml
查看部署情况:kubectl get pods -n kube-system | grep nvidia
7.运行测试
查看GPU节点GPU卡数量
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
测试pod:gpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- image: nvidia/cuda
name: cuda
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
kubectl create -f gpu-pod.yaml
kubectl logs gpu-pod