Install GPU Operator

Precondition

$ kubectl create ns gpu-operator
$ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

  1. Add the NVIDIA Helm repository:

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update

  1. Install the Operator and specify configuration options:

$ helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set <option-name>=<option-value>

Pre-Installed NVIDIA Container Toolkit (but no drivers)

In this scenario, the NVIDIA Container Toolkit is already installed on the worker nodes that have GPUs.

  1. Configure toolkit to use the root directory of the driver installation as /run/nvidia/driver, because this is the path mounted by driver container.

$ sudo apt -y install nvidia-container-toolkit
$ nvidia-ctk runtime configure --runtime=crio
$ sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml
$ sudo systemctl restart containerd kubelet

  1. Install the Operator with the following options (which will provision a driver):

$ helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set toolkit.enabled=false

If the k8s node is not clean, go to Step 2

To disable operands from getting deployed on a GPU worker node, label the node with nvidia.com/gpu.deploy.operands=false.

$ kubectl label nodes $NODE nvidia.com/gpu.deploy.operands=false --overwrite
# Continue...
$ kubectl label nodes $NODE nvidia.com/gpu.deploy.operands=true --overwrite

Emphasis !Emphasis !Emphasis !

Before label "nvidia.com/gpu.deploy.operands=true", to be install NVIDIADriver CRD

One Driver Type and Version on All Nodes

  1. Optional: Remove previously applied node labels.
  2. Create a file, such as nvd-all.yaml, with contents like the following:

apiVersion: nvidia.com/v1alpha1
kind: NVIDIADriver
metadata:
  name: demo-all
spec:
  driverType: gpu
  image: driver
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  manager: {}
  rdma:
    enabled: false
    useHostMofed: false
  gds:
    enabled: false
  repository: nvcr.io/nvidia
  startupProbe:
    failureThreshold: 120
    initialDelaySeconds: 60
    periodSeconds: 10
    timeoutSeconds: 60
  usePrecompiled: false
  version: 535.104.12

Tip Because the manifest does not include a nodeSelector field, the driver custom resource selects all nodes in the cluster that have an NVIDIA GPU.

  1. Apply the manfiest:

$ kubectl apply -n gpu-operator -f nvd-all.yaml

Upgrading the NVIDIA GPU Operator

  1. Specify the Operator release tag in an environment variable:

$ export RELEASE_TAG=v23.9.2

  1. Fetch the values from the chart:

$ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml

  1. Update the values file as needed.
  2. Upgrade the Operator:

$ helm upgrade $(helm ls -n gpu-operator | awk '{print $1}' | tail -n +2) nvidia/gpu-operator -n gpu-operator -f values-$RELEASE_TAG.yaml

  1. Disabled auto_upgrade policy

# driver.upgradePolicy.autoUpgrade: false

Example Output

Release "gpu-operator" has been upgraded. Happy Helming!
NAME: gpu-operator
LAST DEPLOYED: Thu Apr 20 15:05:52 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 2
TEST SUITE: None

About Upgrading the GPU Driver

NVIDIA GPU Driver Custom Resource Definition — NVIDIA GPU Operator 24.3.0 documentation

  1. Optional: If you want to run more than one driver type or version in the cluster, label the worker nodes to identify the driver type and version to install on each node:Example

$ kubectl label node <node-name> --overwrite driver.version=525.125.06

  • To use a mix of driver types, such as vGPU, label nodes for the driver type.
  • To use a mix of driver versions, label the nodes for the different versions.
  • To use a mix of conventional drivers and precompiled driver containers, label the nodes for the different types.<
  • 16
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

医者运维

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值