GPUmanager部署测试

GPUManager部署

1、部署gpu-manager-daemonset和gpu-quota-admission服务

kubectl apply -f gpu-manager.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-quota-admission
  namespace: kube-system
data:
  gpu-quota-admission.config: |
    {
         "QuotaConfigMapName": "gpuquota",
         "QuotaConfigMapNamespace": "kube-system",
         "GPUModelLabel": "gaia.tencent.com/gpu-model",
         "GPUPoolLabel": "gaia.tencent.com/gpu-pool"
     }

---
apiVersion: v1
kind: Service
metadata:
  name: gpu-quota-admission
  namespace: kube-system
spec:
  ports:
  - port: 3456
    protocol: TCP
    targetPort: 3456
  selector:
    k8s-app: gpu-quota-admission
  type: ClusterIP

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: gpu-quota-admission
  name: gpu-quota-admission
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: gpu-quota-admission
  template:
    metadata:
      labels:
        k8s-app: gpu-quota-admission
      namespace: kube-system
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: Exists
            weight: 1
      containers:
      - env:
        - name: LOG_LEVEL
          value: "4"
        - name: EXTRA_FLAGS
          value: --incluster-mode=true
        image: ccr.ccs.tencentyun.com/tkeimages/gpu-quota-admission:latest
        imagePullPolicy: IfNotPresent
        name: gpu-quota-admission
        ports:
        - containerPort: 3456
          protocol: TCP
        resources:
          limits:
            cpu: "2"
            memory: 2Gi
          requests:
            cpu: "1"
            memory: 1Gi
        volumeMounts:
        - mountPath: /root/gpu-quota-admission/
          name: config
      dnsPolicy: ClusterFirstWithHostNet
      initContainers:
      - command:
        - sh
        - -c
        - ' mkdir -p /etc/kubernetes/ && cp /root/gpu-quota-admission/gpu-quota-admission.config
          /etc/kubernetes/'
        image: busybox
        imagePullPolicy: Always
        name: init-kube-config
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /root/gpu-quota-admission/
          name: config
      priority: 2000000000
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      serviceAccount: gpu-manager
      serviceAccountName: gpu-manager
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      volumes:
      - configMap:
          defaultMode: 420
          name: gpu-quota-admission
        name: config

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-manager
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: gpu-manager
  namespace: kube-system

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-manager
  namespace: kube-system

---
apiVersion: v1
kind: Service
metadata:
  name: gpu-manager-metric
  namespace: kube-system
  annotations:
    prometheus.io/scrape: "true"
  labels:
    kubernetes.io/cluster-service: "true"
spec:
  clusterIP: None
  ports:
    - name: metrics
      port: 5678
      protocol: TCP
      targetPort: 5678
  selector:
    name: gpu-manager-ds

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-manager-daemonset
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      name: gpu-manager-ds
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: gpu-manager-ds
    spec:
      serviceAccount: gpu-manager
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - key: tencent.com/vcuda-core
          operator: Exists
          effect: NoSchedule
      priorityClassName: "system-node-critical"
      nodeSelector:
        nvidia-device-enable: enable
      hostPID: true
      initContainers:
        - image: menghe.tencentcloudcr.com/public/alpine-uvm:0.1
          imagePullPolicy: IfNotPresent
          name: nvidia-uvm-enable
          securityContext:
            capabilities:
              add:
              - ALL
          volumeMounts:
          - mountPath: /lib/modules
            name: lib-modules
            readOnly: true
          - mountPath: /dev
            name: dev
      containers:
        - image: tkestack/gpu-manager:v1.0.4
          imagePullPolicy: IfNotPresent
          name: gpu-manager
          securityContext:
            privileged: true
          ports:
            - containerPort: 5678
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: vdriver
              mountPath: /etc/gpu-manager/vdriver
            - name: vmdata
              mountPath: /etc/gpu-manager/vm
            - name: log
              mountPath: /var/log/gpu-manager
            - mountPath: /var/run/docker.sock
              name: docker 
              readOnly: true 
            - name: run-dir
              mountPath: /var/run
            - name: cgroup
              mountPath: /sys/fs/cgroup
              readOnly: true
            - name: usr-directory
              mountPath: /usr/local/host
              readOnly: true
          env:
            - name: LOG_LEVEL
              value: "4"
            - name: EXTRA_FLAGS
              value: --incluster-mode=true
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
      volumes:
        - name: device-plugin
          hostPath:
            type: Directory
            path: /var/lib/kubelet/device-plugins
        - name: vmdata
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vm
        - name: vdriver
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vdriver
        - name: log
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/log
        - name: docker
          hostPath:
            type: File
            path: /var/run/docker.sock
        - name: cgroup
          hostPath:
            type: Directory
            path: /sys/fs/cgroup
        - name: usr-directory
          hostPath:
            type: Directory
            path: /usr
        - name: run-dir
          hostPath:
            type: Directory
            path: /var/run
        - name: lib-modules
          hostPath:
            type: Directory
            path: /lib/modules
        - name: dev
          hostPath:
            type: Directory
            path: /dev/

2、给GPU节点打nvidia-device-enable=enable 标签

kubectl label node *.*.*.* nvidia-device-enable=enable

3、验证gpu-manager-daemonset是否正确派发到GPU节点

kubectl get pods -n kube-system

4、配置kube-scheduler static pod dns policy为ClusterFirstWithHostNet

5、kube-scheduler 开启policy-config选项,增加如下启动参数:

--policy-config-file=/etc/kubernetes/scheduler-policy-config.json
--use-legacy-policy-config=true

复制

​ /etc/kubernetes/scheduler-policy-config.json配置文件内容:

{
   "apiVersion" : "v1",
   "extenders" : [
      {
         "apiVersion" : "v1beta1",
         "enableHttps" : false,
         "filterVerb" : "predicates",
         "managedResources" : [
            {
               "ignoredByScheduler" : false,
               "name" : "tencent.com/vcuda-core"
            }
         ],
         "nodeCacheCapable" : false,
         "urlPrefix" : "http://gpu-quota-admission.kube-system:3456/scheduler"
      }
   ],
   "kind" : "Policy"
}

方案测试

​ 方案测试采用Tensorflow框架,内置了Mnist,cifar10和Alexnet benchmark等测试数据集,可以根据需要选择不同的测试方案。

测试步骤:

  1. 使用TensorFlow框架+minst数据集进行测试验证,TensorFlow镜像:

   apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: vcuda-test
    qcloud-app: vcuda-test
  name: vcuda-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: vcuda-test
  template:
    metadata:
      labels:
        k8s-app: vcuda-test
        qcloud-app: vcuda-test
    spec:
      containers:
      - command:
        - sleep
        - 360000s
        env:
        - name: PATH
          value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
        image: menghe.tencentcloudcr.com/public/tensorflow-gputest:0.2
        imagePullPolicy: IfNotPresent
        name: tensorflow-test
        resources:
          limits:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "50"
            tencent.com/vcuda-memory: "32"
          requests:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "50"
            tencent.com/vcuda-memory: "32"

  1. 进入测试容器(在默认default namespace下,如修改了测试yaml,按需指定namespace)
kubectl exec -it `kubectl get pods -o name | cut -d '/' -f2` -- bash

4. 执行测试命令,可以根据需求选择不同训练框架/数据集

 a. Mnist

cd /data/tensorflow/mnist && time python convolutional.py

复制
 b. AlexNet

cd /data/tensorflow/alexnet && time python alexnet_benchmark.py

复制
   c. Cifar10
cd /data/tensorflow/cifar10 && time python cifar10_train.py
复制
5. 在物理机上通过nvidia-smi pmon -s u -d 1命令查看GPU资源使用情况
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值