rancher部署gpushare-scheduler-extender_部署gpu共享调度插件gpushare-schd-extender-CSDN博客

本文链接：https://blog.csdn.net/vah101/article/details/108188420

gpushare-scheduler-extender是阿里云在kubernetes平台上开发的针对GPU进行虚拟化的方案，

首先，参考k8s集群中GPU结点的配置_vah101的专栏-CSDN博客，安装k8s-deviece-plugin，并将/etc/docker/daemon.json配置为：

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

1. 修个原有集群的kube-scheduler的配置参数

rancher与原生的kubernetes略有不同，

它的kube-scheduler并不是一个可执行程序，而是一个docker镜像。gpushare-scheduler-extender需要将修改kube-scheduler的配置文件，所以就需要首先对rke的配置进行修改，将新增加的配置信息加入到rek的yml中。

在rancher的集群列表中，选择一个集群，点击其右侧按钮对应的“升级”按钮

点击“编辑YAML”按钮，在其中的services之下加入如下内容：

    scheduler:
      extra_args:
        address: 0.0.0.0
        kubeconfig: /etc/kubernetes/ssl/kubecfg-kube-scheduler.yaml
        leader-elect: 'true'
        policy-config-file: /etc/kubernetes/ssl/scheduler-policy-config.json
        profiling: 'false'
        v: '2'

注意，这里的内容可以冲master结点上，docker inspect kube-scheduler找到，在此基础上，加入policy-config-file: /etc/kubernetes/ssl/scheduler-policy-config.json即可

2. 将scheduler-policy-config.json拷贝到/etc/kubernetes/ssl/目录下，在每个master节点上都要执行

cd /etc/kubernetes/ssl/
curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.json

3. 启动gpushare-schd-extender，

curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
kubectl create -f gpushare-schd-extender.yaml

注意默认的gpushare-schd-extender.yaml配置是不在master节点上启动gpushare-schd-extender的，如果你的GPU恰巧在master结点上，则要将node-role.kubernetes.io/master相关的NoSchedule配置删掉

      nodeSelector:
         node-role.kubernetes.io/master: ""

4. 部署gpushare-device-plugin

注意，如果之前装过nvidia-device-plugin，则要先将其删除掉

kubectl delete ds -n kube-system nvidia-device-plugin-daemonset

之后：

wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f device-plugin-rbac.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
kubectl create -f device-plugin-ds.yaml

5. 为GPU结点打标签

为了将GPU程序调度到带有GPU的服务器，需要给服务打标签gpushare=true

通过命令行的执行方式为：

#kubectl label node <target_node> gpushare=true
#如果我们的GPU服务器主机名为GPU_NODE
kubectl label node GPU_NODE gpushare=true

6. 更新kubectl可执行程序：

cd /usr/bin/
wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
chmod u+x /usr/bin/kubectl-inspect-gpushare

7. 获取示例程序：

wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/1.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/2.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/3.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/4.yaml

根据需要分别使用kubectl create -f 来运行示例

pytorch的实例程序：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch
  labels:
    app: pytorch
spec:
  replicas: 1
  selector: 
    matchLabels:
      app: pytorch
  template: 
    metadata:
      labels:
        app: pytorch
    spec:
      containers:
      - name: pytorch
        image: pytorch/pytorch
        args: [/bin/sh, -c, 'while true ;do   sleep 1000;done']
        resources:
          limits:
            aliyun.com/gpu-mem: 2
        volumeMounts:
        - name: workspace
          mountPath: /workspace        #在docker镜像内的路径 
          readOnly: false                          #读写权限
      volumes:
      - name: workspace
        hostPath:
          path: /var/pytorch          #在宿主机上对应的路径

tensorflow的示例程序：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-gpu-jupyter
  labels:
    app: tensorflow-gpu-jupyter
spec:
  replicas: 1
  selector: 
    matchLabels:
      app: tensorflow-gpu-jupyter
  template: 
    metadata:
      labels:
        app: tensorflow-gpu-jupyter
    spec:
      containers:
      - name: tensorflow-gpu-jupyter
        image: tensorflow/tensorflow:latest-gpu-jupyter
        resources:
          limits:
            aliyun.com/gpu-mem: 3
---
apiVersion: v1
kind: Service
metadata:
 name: tensorflow-gpu-jupyter
 labels:
  app: tensorflow-gpu-jupyter
spec:
 type: NodePort
 ports:
 - port: 8888
   targetPort: 8888
   nodePort: 30567
 selector:
  app: tensorflow-gpu-jupyter