Install GPU Operator

医者运维

已于 2024-06-03 15:56:07 修改

阅读量784

点赞数 16

文章标签： linux 运维服务器

于 2024-05-28 11:10:37 首次发布

本文链接：https://blog.csdn.net/tty2020/article/details/139261219

版权

Precondition

$ kubectl create ns gpu-operator
$ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

Add the NVIDIA Helm repository:

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update

Install the Operator and specify configuration options:

$ helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set <option-name>=<option-value>

Pre-Installed NVIDIA Container Toolkit (but no drivers)

In this scenario, the NVIDIA Container Toolkit is already installed on the worker nodes that have GPUs.

Configure toolkit to use the root directory of the driver installation as /run/nvidia/driver, because this is the path mounted by driver container.

$ sudo apt -y install nvidia-container-toolkit
$ nvidia-ctk runtime configure --runtime=crio
$ sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml
$ sudo systemctl restart containerd kubelet

Install the Operator with the following options (which will provision a driver):

$ helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set toolkit.enabled=false

If the k8s node is not clean, go to Step 2

To disable operands from getting deployed on a GPU worker node, label the node with nvidia.com/gpu.deploy.operands=false.

$ kubectl label nodes $NODE nvidia.com/gpu.deploy.operands=false --overwrite
# Continue...
$ kubectl label nodes $NODE nvidia.com/gpu.deploy.operands=true --overwrite

Emphasis !Emphasis !Emphasis !

Before label "nvidia.com/gpu.deploy.operands=true", to be install NVIDIADriver CRD

One Driver Type and Version on All Nodes

Optional: Remove previously applied node labels.
Create a file, such as nvd-all.yaml, with contents like the following:

apiVersion: nvidia.com/v1alpha1
kind: NVIDIADriver
metadata:
  name: demo-all
spec:
  driverType: gpu
  image: driver
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  manager: {}
  rdma:
    enabled: false
    useHostMofed: false
  gds:
    enabled: false
  repository: nvcr.io/nvidia
  startupProbe:
    failureThreshold: 120
    initialDelaySeconds: 60
    periodSeconds: 10
    timeoutSeconds: 60
  usePrecompiled: false
  version: 535.104.12

Tip Because the manifest does not include a nodeSelector field, the driver custom resource selects all nodes in the cluster that have an NVIDIA GPU.

Apply the manfiest:

$ kubectl apply -n gpu-operator -f nvd-all.yaml

Upgrading the NVIDIA GPU Operator

Specify the Operator release tag in an environment variable:

$ export RELEASE_TAG=v23.9.2

Fetch the values from the chart:

$ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml

Update the values file as needed.
Upgrade the Operator:

$ helm upgrade $(helm ls -n gpu-operator | awk '{print $1}' | tail -n +2) nvidia/gpu-operator -n gpu-operator -f values-$RELEASE_TAG.yaml

Disabled auto_upgrade policy

# driver.upgradePolicy.autoUpgrade: false

Example Output

Release "gpu-operator" has been upgraded. Happy Helming!
NAME: gpu-operator
LAST DEPLOYED: Thu Apr 20 15:05:52 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 2
TEST SUITE: None

About Upgrading the GPU Driver

NVIDIA GPU Driver Custom Resource Definition — NVIDIA GPU Operator 24.3.0 documentation

Optional: If you want to run more than one driver type or version in the cluster, label the worker nodes to identify the driver type and version to install on each node:Example

$ kubectl label node <node-name> --overwrite driver.version=525.125.06

To use a mix of driver types, such as vGPU, label nodes for the driver type.
To use a mix of driver versions, label the nodes for the different versions.
To use a mix of conventional drivers and precompiled driver containers, label the nodes for the different types.<