现有K8S集群中添加GPU节点

背景

产线有一套用RKE搭建的K8S集群,由于业务需要,需要通过GPU来运行一些业务,所以需要集群中添加GPU节点

现有环境

RKE: Running RKE version: v1.1.2

Kubernetes: 1.17

  -  Master节点: 3个

  -  Worker节点: 6个(全部为CPU节点)

新节点信息

操作系统: CentOS 7.6

GPU 卡数: 1张 (已安装驱动版本: 440.95.01)

IP: 10.5.0.112

步骤

1.初始化配置新节点

   a.安装docker

#这个步骤是安装docker 19.03版本
curl https://releases.rancher.com/install-docker/19.03.sh | sh 

  b.安装nvidia-docker2

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo |   sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum install -y nvidia-docker2

 c. 配置docker的默认运行时

vi /etc/docker/daemon.json

  文件内容

{
"registry-mirrors": [
    "https://dockerhub.azk8s.cn",
    "https://docker.mirrors.ustc.edu.cn",
    "http://hub-mirror.c.163.com"
  ],
  "max-concurrent-downloads": 10,
  "log-driver": "json-file",
  "log-level": "warn",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
    },
  "data-root": "/data/docker",
  "group": "docker",
  "default-runtime": "nvidia",
  "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

  d. 添加rke使用的账号,并配置rke控制节点到新节点账号的免密( 我的rke控制节点IP为: 10.4.0.57 ,执行 ssh-copy-id docker@10.5.0.112 就可以完成免密配置了)

useradd docker -g docker

2. 将新的节点加入到rke搭建的集群配置中

rke --debug up --update-only --config rancher_v2.yaml

 rke配置文件内容

ssh_key_path: ~/.ssh/id_rsa
nodes:
  - address: 10.159.1.247 
    internal_address: 10.4.0.37
    user: docker 
    role: [controlplane, etcd]
  - address: 10.159.1.67
    internal_address: 10.4.0.24
    user: docker
    role: [controlplane, etcd]
  - address: 10.159.1.242
    internal_address: 10.4.0.38
    user: docker
    role: [controlplane, etcd]
  - address: 10.4.0.63
    internal_address: 10.4.0.63
    user: docker
    role: [worker]
  - address: 10.4.0.18
    internal_address: 10.4.0.18
    user: docker
    role: [worker]
  - address: 10.4.0.43
    internal_address: 10.4.0.43
    user: docker
    role: [worker]
  - address: 10.4.0.80
    internal_address: 10.4.0.80
    user: docker
    role: [worker]
  - address: 10.4.0.26
    internal_address: 10.4.0.26
    user: docker
    role: [worker]
  - address: 10.4.0.111
    internal_address: 10.4.0.111
    user: docker
    role: [worker]
  - address: 10.5.0.112  #新节点
    internal_address: 10.5.0.112
    user: docker
    role: [worker]
services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24
    backup_config:
      enabled: true
      interval_hours: 12
      retention: 6
  kube-api:
    service_node_port_range: 20000-60000
#  kubelet:
#    extra_binds:
#       - "/data:/data:rshared"

# 禁用RKE默认的nginx-ingress,我喜欢使用traefik-ingress
ingress:
  provider: none
#  options:
#    use-forwarded-headers: 'true'

network:
  mtu: 1450
  plugin: canal
  options:
    flannel_backend_type: "vxlan"

# 这个地址是一个SLB的地址用于做API Server的高可用
authentication:
  sans:
    - "10.159.1.163"
kubernetes_version: "v1.17.6-rancher2-1"

3. 更新rke集群,并等待新的节点处于ready状态

4. 安装nvidia-gpu-plugin,这个服务是将GPU节点的GPU信息汇报给K8S集群,以提供给调度器使用

kubectl apply -f nvidia-gpu-plugin.yaml

nvidia-gpu-plugin.yaml 内容

# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # This toleration is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      nodeSelector:
        nvidia.com/gpu.present: 'true'
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.9.0
        name: nvidia-device-plugin-ctr
        args: ["--fail-on-init-error=false"]
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

5. 等nvidia-gpu-plugin容器启动完成以后,查看日志,看看GPU信息是不是被正确识别了,正常的输出如下

[root@rke-controller ]# kubectl get po -n kube-system |grep nvidia
nvidia-device-plugin-daemonset-7hrk5      1/1     Running     0          28m
[root@rke-controller ]# kubectl logs nvidia-device-plugin-daemonset-7hrk5 -n kube-system
2021/05/19 08:15:49 Loading NVML
2021/05/19 08:15:49 Starting FS watcher.
2021/05/19 08:15:49 Starting OS watcher.
2021/05/19 08:15:49 Retreiving plugins.
2021/05/19 08:15:49 Starting GRPC server for 'nvidia.com/gpu'
2021/05/19 08:15:49 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/05/19 08:15:49 Registered device plugin for 'nvidia.com/gpu' with Kubelet

6.创建一个pod来验证在K8S中运行容器

kubectl apply -f gpu_test.yaml

 gpu_test.yaml内容

apiVersion: v1
kind: Pod
metadata:
   name: dcgmproftester
spec:
   restartPolicy: OnFailure
   containers:
   - name: dcgmproftester11
     image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04
     args: ["--no-dcgm-validation", "-t 1004", "-d 120"]
     resources:
       limits:
          nvidia.com/gpu: 1
     securityContext:
       capabilities:
         add: ["SYS_ADMIN"]

7.Pod运行成功,并正确打印信息

[root@rke-controller]# kubectl get po
NAME                              READY   STATUS      RESTARTS   AGE
dcgmproftester                    1/1     Running     0          73s
details-v1-5974b67c8-rgbgb        2/2     Running     0          132d
[root@rke-controller]# kubectl logs dcgmproftester -f
Skipping CreateDcgmGroups() since DCGM validation is disabled
CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR: 1024
CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT: 40
CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR: 65536
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR: 7
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR: 5
CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH: 256
CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE: 5001000
Max Memory bandwidth: 320064000000 bytes (320.06 GiB)
CudaInit completed successfully.

Skipping WatchFields() since DCGM validation is disabled
TensorEngineActive: generated ???, dcgm 0.000 (27804.3 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (28499.7 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (28529.5 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (28576.3 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (28385.3 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (28379.4 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (28755.1 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (29019.5 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (28880.3 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (28932.4 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (28704.2 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (28844.0 gflops)

  • 4
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 7
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值