Kubernetes中调度GPU资源

Kubernetes中调度GPU资源

Kubernetes 包含一个体验性的功能,支持 AMD和NVIDIA GPUs 跨节点调度。对 NVIDIA GPUs 支持从 v1.6开始,然后经过几次不兼容的叠代修改,对AMD GPUs 的支持从 v1.9 开始,通过 device plugin提供。

本文描述了用户在不同版本的kubernetes使用GPUs的方法及其当前版本的限制。

v1.8 以后

从1.8开始, 建议调用 GPUs 的方法是通过使用 device plugins

为了启用 GPU支持,在1.10之前, 该DevicePlugins feature gate 需要通过系统设置来激活: --feature-gates="DevicePlugins=true". 但在 1.10及以后,不再需要这一设置。

您还需要安装 GPU drivers到各个节点,驱动和device plugin都由相应的GPU生产厂家提供 (AMD, NVIDIA)。

当上述条件满足时, Kubernetes 服务将提供名称为 nvidia.com/gpuamd.com/gpu 作为可调度的资源。

You can consume these GPUs from your containers by requesting <vendor>.com/gpu just like you request cpu or memory. However, there are some limitations in how you specify the resource requirements when using GPUs:

  • GPUs are only supposed to be specified in the limits section, which means:
    • You can specify GPU limits without specifying requests because Kubernetes will use the limit as the request value by default.
    • You can specify GPU in both limits and requests but these two values must be equal.
    • You cannot specify GPU requests without specifying limits.
  • Containers (and pods) do not share GPUs. There’s no overcommitting of GPUs.
  • Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.

Here’s an example:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU

Deploying AMD GPU device plugin

The official AMD GPU device plugin has the following requirements:

  • Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.

To deploy the AMD device plugin once your cluster is running and the above requirements are satisfied:

# For Kubernetes v1.9
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/r1.9/k8s-ds-amdgpu-dp.yaml

# For Kubernetes v1.10
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/r1.10/k8s-ds-amdgpu-dp.yaml

Report issues with this device plugin to RadeonOpenCompute/k8s-device-plugin.

Deploying NVIDIA GPU device plugin

There are currently two device plugin implementations for NVIDIA GPUs:

Official NVIDIA GPU device plugin

The official NVIDIA GPU device plugin has the following requirements:

  • Kubernetes nodes have to be pre-installed with NVIDIA drivers.
  • Kubernetes nodes have to be pre-installed with nvidia-docker 2.0
  • nvidia-container-runtime must be configured as the default runtime for docker instead of runc.
  • NVIDIA drivers ~= 361.93

To deploy the NVIDIA device plugin once your cluster is running and the above requirements are satisfied:

# For Kubernetes v1.8
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml

# For Kubernetes v1.9
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml

Report issues with this device plugin to NVIDIA/k8s-device-plugin.

NVIDIA GPU device plugin used by GCE

The NVIDIA GPU device plugin used by GCE doesn’t require using nvidia-docker and should work with any container runtime that is compatible with the Kubernetes Container Runtime Interface (CRI). It’s tested on Container-Optimized OS and has experimental code for Ubuntu from 1.9 onwards.

On your 1.12 cluster, you can use the following commands to install the NVIDIA drivers and device plugin:

# Install NVIDIA drivers on Container-Optimized OS:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml

# Install NVIDIA drivers on Ubuntu (experimental):
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/ubuntu/daemonset.yaml

# Install the device plugin:
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.12/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

Report issues with this device plugin and installation method to GoogleCloudPlatform/container-engine-accelerators.

Instructions for using NVIDIA GPUs on GKE are here

Clusters containing different types of NVIDIA GPUs

If different nodes in your cluster have different types of NVIDIA GPUs, then you can use Node Labels and Node Selectors to schedule pods to appropriate nodes.

For example:

# Label your nodes with the accelerator type they have.
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100

Specify the GPU type in the pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.

This will ensure that the pod will be scheduled to a node that has the GPU type you specified.

转载于:https://my.oschina.net/u/2306127/blog/2996964

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值