openpai的tensorflow利用k8s分布式训练之FrameworkController

最新推荐文章于 2023-08-30 12:25:37 发布

Tilyp

最新推荐文章于 2023-08-30 12:25:37 发布

阅读量1.4k

点赞数

分类专栏： openpai kubernetes tensorflow

本文链接：https://blog.csdn.net/Tilyp/article/details/103736721

版权

kubernetes 同时被 3 个专栏收录

10 篇文章 0 订阅

订阅专栏

tensorflow

3 篇文章 0 订阅

订阅专栏

openpai

2 篇文章 0 订阅

订阅专栏

简述：

openpai在基于YARN的任务调度工具FrameworkLaucher之后又添加了基于K8S的任务调度工具FrameworkController，感觉和kubeflow的TFJob类似，我们先来试试FrameworkController这工具如何单独使用，

环境：

k8s: 1.15.1

docker: 18.09.5

对系统环境的要求不严格，ubuntu，centos都可以使用，其他系统没有验证过。

安装FrameworkController：

为FrameworkController创建 Service Account 和 ClusterRole

kubectl create serviceaccount frameworkcontroller --namespace default
kubectl create clusterrolebinding frameworkcontroller \
  --clusterrole=cluster-admin \
  --user=system:serviceaccount:default:frameworkcontroller

编写frameworkcontroller.yaml文件

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: frameworkcontroller
  namespace: default
spec:
  serviceName: frameworkcontroller
  selector:
    matchLabels:
      app: frameworkcontroller
  replicas: 1
  template:
    metadata:
      labels:
        app: frameworkcontroller
    spec:
      # Using the service account with granted permission
      # if the k8s cluster enforces authorization.
      serviceAccountName: frameworkcontroller
      containers:
      - name: frameworkcontroller
        image: frameworkcontroller/frameworkcontroller
        # Using k8s inClusterConfig, so usually, no need to specify
        # KUBE_APISERVER_ADDRESS or KUBECONFIG
        #env:
        #- name: KUBE_APISERVER_ADDRESS
        #  value: {http[s]://host:port}
        #- name: KUBECONFIG
        #  value: {Pod Local KubeConfig File Path}

创建framework控制器

kubectl create -f frameworkcontroller.yaml

查看创捷结果

kubectl get pod -n default

NAME                    READY   STATUS    RESTARTS   AGE
frameworkcontroller-0   1/1     Running   0          30s

FrameWorkBarrier测试：

为FrameWorkBarrier创建 Service Account 和 ClusterRole

kubectl create serviceaccount frameworkbarrier --namespace default
kubectl create clusterrole frameworkbarrier \
            --verb=get,list,watch \
            --resource=frameworks
kubectl create clusterrolebinding frameworkbarrier \
     --clusterrole=frameworkbarrier \
     --user=system:serviceaccount:default:frameworkbarrier

编写frameworkbarrier.yaml文件

apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
  name: frameworkbarrier
spec:
  executionType: Start
  retryPolicy:
    fancyRetryPolicy: true
    maxRetryCount: 0
  taskRoles:
  - name: server
    taskNumber: 10
    frameworkAttemptCompletionPolicy:
      minFailedTaskCount: 1
      minSucceededTaskCount: -1
    task:
      retryPolicy:
        fancyRetryPolicy: false
        maxRetryCount: 0
      pod:
        spec:
          restartPolicy: Never
          containers:
          - name: ubuntu
            image: ubuntu:trusty
            # Using /mnt/frameworkbarrier/injector.sh to inject environment variables,
            # such as:
            # FB_{UpperCase({TaskRoleName})}_IPS=
            #   {Task[0].PodIP},...,
            #   {Task[TaskRole.TaskNumber-1].PodIP}
            # FB_{UpperCase({TaskRoleName})}_ADDRESSES=
            #   {Task[0].PodIP}:${FB_{UpperCase({TaskRoleName})}_PORT},...,
            #   {Task[TaskRole.TaskNumber-1].PodIP}:${FB_{UpperCase({TaskRoleName})}_PORT}
            #  Note, the environment variable FB_{UpperCase({TaskRoleName})}_PORT should be
            #  provided by the caller in advance.
            #
            # User may need to tweak these environment variables to its own
            # input format.
            #
            # User can also write its own injector script to inject other
            # Framework information from the Framework object file:
            # /mnt/frameworkbarrier/framework.json.
            command: [
            "sh", "-c",
            "FB_SERVER_PORT=4001 FB_WORKER_PORT=5001 . /mnt/frameworkbarrier/injector.sh && printenv &&
            FB_SERVER_PORT=4002 FB_WORKER_PORT=5002 . /mnt/frameworkbarrier/injector.sh && printenv &&
            sleep 60"]
            ports:
            - containerPort: 4001
            - containerPort: 4002
            volumeMounts:
            - name: frameworkbarrier-volume
              mountPath: /mnt/frameworkbarrier
          # [PREREQUISITE]
          # User needs to create a service account in the same namespace of this
          # Framework with granted permission for frameworkbarrier, if the k8s
          # cluster enforces authorization.
          # For example, if the cluster enforces RBAC:
          #   kubectl create serviceaccount frameworkbarrier --namespace default
          #   kubectl create clusterrole frameworkbarrier \
          #     --verb=get,list,watch \
          #     --resource=frameworks
          #   kubectl create clusterrolebinding frameworkbarrier \
          #     --clusterrole=frameworkbarrier \
          #     --user=system:serviceaccount:default:frameworkbarrier
          serviceAccountName: frameworkbarrier
          initContainers:
          - name: frameworkbarrier
            # Using official image to demonstrate this example.
            image: frameworkcontroller/frameworkbarrier
            # Using k8s inClusterConfig, so usually, no need to specify
            # KUBE_APISERVER_ADDRESS or KUBECONFIG
            #env:
            #- name: KUBE_APISERVER_ADDRESS
            #  value: {http[s]://host:port}
            #- name: KUBECONFIG
            #  value: {Pod Local KubeConfig File Path}
            volumeMounts:
            - name: frameworkbarrier-volume
              mountPath: /mnt/frameworkbarrier
          volumes:
          - name: frameworkbarrier-volume
            emptyDir: {}
  - name: worker
    taskNumber: 10
    frameworkAttemptCompletionPolicy:
      minFailedTaskCount: 1
      minSucceededTaskCount: -1
    task:
      retryPolicy:
        fancyRetryPolicy: false
        maxRetryCount: 0
      pod:
        spec:
          restartPolicy: Never
          containers:
          - name: ubuntu
            image: ubuntu:trusty
            command: [
            "sh", "-c",
            "FB_SERVER_PORT=4001 FB_WORKER_PORT=5001 . /mnt/frameworkbarrier/injector.sh && printenv &&
            FB_SERVER_PORT=4002 FB_WORKER_PORT=5002 . /mnt/frameworkbarrier/injector.sh && printenv &&
            sleep 60"]
            ports:
            - containerPort: 5001
            - containerPort: 5002
            volumeMounts:
            - name: frameworkbarrier-volume
              mountPath: /mnt/frameworkbarrier
          # [PREREQUISITE]
          # Same as server TaskRole.
          serviceAccountName: frameworkbarrier
          initContainers:
          - name: frameworkbarrier
            image: frameworkcontroller/frameworkbarrier
            #env:
            #- name: KUBE_APISERVER_ADDRESS
            #  value: {http[s]://host:port}
            #- name: KUBECONFIG
            #  value: {Pod Local KubeConfig File Path}
            volumeMounts:
            - name: frameworkbarrier-volume
              mountPath: /mnt/frameworkbarrier
          volumes:
          - name: frameworkbarrier-volume
            emptyDir: {}

部署FrameWorkBarrier

kubectl apply -f frameworkbarrier.yaml -n default

验证

kubectl get pod -n default


NAME                        READY   STATUS     RESTARTS   AGE
frameworkbarrier-server-0   0/1     Init:0/1   0          8s
frameworkbarrier-server-1   0/1     Init:0/1   0          8s
frameworkbarrier-server-2   0/1     Init:0/1   0          8s
frameworkbarrier-server-3   0/1     Init:0/1   0          8s
frameworkbarrier-server-4   0/1     Init:0/1   0          8s
frameworkbarrier-server-5   0/1     Init:0/1   0          8s
frameworkbarrier-server-6   0/1     Init:0/1   0          8s
frameworkbarrier-server-7   0/1     Init:0/1   0          8s
frameworkbarrier-server-8   0/1     Init:0/1   0          8s
frameworkbarrier-server-9   0/1     Init:0/1   0          8s
frameworkbarrier-worker-0   0/1     Init:0/1   0          8s
frameworkbarrier-worker-1   0/1     Init:0/1   0          7s
frameworkbarrier-worker-2   0/1     Init:0/1   0          7s
frameworkbarrier-worker-3   0/1     Init:0/1   0          7s
frameworkbarrier-worker-4   0/1     Init:0/1   0          7s
frameworkbarrier-worker-5   0/1     Init:0/1   0          7s
frameworkbarrier-worker-6   0/1     Init:0/1   0          6s
frameworkbarrier-worker-7   0/1     Init:0/1   0          6s
frameworkbarrier-worker-8   0/1     Init:0/1   0          6s
frameworkbarrier-worker-9   0/1     Init:0/1   0          6s
frameworkcontroller-0       1/1     Running    0          1m55s

查看frameworks

kubectl get framework -n default

NAME                                   AGE
frameworkbarrier                       69s

删除frameworkbarrier

kubectl delete framework frameworkbarrier -n default

分布式运行Tensorflow任务

分布式的任务需要分布式的文件系统，我这里使用了NFS，搭建方式很简单，先创建pv和pvc

data-volume.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: data-volume
  labels:
    pv: data-volume
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  nfs:
    server: 192.168.0.10
    path: /data3/nfs-data/pai/

data-volume-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-volume
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  selector:
    matchLabels:
      pv: data-volume

创建pv和pvc

kubectl apply -f data-volume.yaml 
kubectl apply -f data-volume-pvc.yaml

查看创建结果

kubectl get pv,pvc -n default

NAME                           CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                 STORAGECLASS   REASON   AGE
persistentvolume/data-volume   20Gi       RWO            Retain           Bound    default/data-volume                           89s

NAME                                STATUS   VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/data-volume   Bound    data-volume   20Gi       RWO                           89s

编写tensorflowdistributedtrainingwithcpu.yaml，我这里使用的是cpu版本的

apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
  name: tensorflowdistributedtrainingwithcpu
  namespace: default
spec:
  executionType: Start
  retryPolicy:
    fancyRetryPolicy: true
    maxRetryCount: 2
  taskRoles:
  - name: ps
    taskNumber: 2
    frameworkAttemptCompletionPolicy:
      minFailedTaskCount: 1
      minSucceededTaskCount: -1
    task:
      retryPolicy:
        fancyRetryPolicy: false
        maxRetryCount: 0
      pod:
        spec:
          restartPolicy: Never
          # [PREREQUISITE]
          # User needs to setup the k8s cluster networking model and aware the
          # potential network overhead, if he want to disable the hostNetwork to
          # avoid the coordination of the containerPort usage.
          # And for this example, if the hostNetwork is disabled, it only needs
          # at least 1 node, otherwise, it needs at least 3 nodes since all the
          # 3 workers are specified with the same containerPort.
          # See https://kubernetes.io/docs/concepts/cluster-administration/networking
          hostNetwork: false
          containers:
          - name: tensorflow
            # Using official image to demonstrate this example.
            # The image contains and only contains tensorflow official code.
            image: frameworkcontroller/tensorflow-examples:cpu
            # For the tf_cnn_benchmarks usage, see
            # https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
            workingDir: /tensorflow/benchmarks/scripts/tf_cnn_benchmarks
            # Using /mnt/frameworkbarrier/injector.sh to inject environment variables
            # without the need for image invasion and k8s DNS:
            # FB_{UpperCase({TaskRoleName})}_ADDRESSES=
            #   {Task[0].PodIP}:${FB_{UpperCase({TaskRoleName})}_PORT},...,
            #   {Task[TaskRole.TaskNumber-1].PodIP}:${FB_{UpperCase({TaskRoleName})}_PORT}
            # See more in ./example/framework/extension/frameworkbarrier.yaml
            command: [
            "sh", "-c",
            "FB_PS_PORT=4001 FB_WORKER_PORT=5001 . /mnt/frameworkbarrier/injector.sh &&
            python tf_cnn_benchmarks.py --job_name=ps --task_index=${FC_TASK_INDEX}
            --ps_hosts=${FB_PS_ADDRESSES} --worker_hosts=${FB_WORKER_ADDRESSES}
            --variable_update=parameter_server --cross_replica_sync=false
            --model=alexnet --batch_size=8 --num_batches=10
            --device=cpu --local_parameter_device=cpu --data_format=NHWC
            --data_name=cifar10 --data_dir=/mnt/data/cifar-10-batches-py
            --train_dir=/mnt/data/${FC_FRAMEWORK_NAME}/output"]
            ports:
            - containerPort: 4001
            volumeMounts:
            - name: frameworkbarrier-volume
              mountPath: /mnt/frameworkbarrier
            - name: data-volume
              mountPath: /mnt/data
          # [PREREQUISITE]
          # User needs to create a service account for frameworkbarrier, if the
          # k8s cluster enforces authorization.
          # See more in ./example/framework/extension/frameworkbarrier.yaml
          serviceAccountName: frameworkbarrier
          initContainers:
          - name: frameworkbarrier
            # Using official image to demonstrate this example.
            image: frameworkcontroller/frameworkbarrier
            # Using k8s inClusterConfig, so usually, no need to specify
            # KUBE_APISERVER_ADDRESS or KUBECONFIG
            #env:
            #- name: KUBE_APISERVER_ADDRESS
            #  value: {http[s]://host:port}
            #- name: KUBECONFIG
            #  value: {Pod Local KubeConfig File Path}
            volumeMounts:
            - name: frameworkbarrier-volume
              mountPath: /mnt/frameworkbarrier
          volumes:
          - name: frameworkbarrier-volume
            emptyDir: {}
          - name: data-volume
            persistentVolumeClaim:           # pvc
              claimName: data-volume
            # [PREREQUISITE]
            # User needs to specify his own data-volume for input data and
            # output model.
            # The data-volume must be a distributed shared file system, so that
            # data can be "handed off" between Pods, such as nfs, cephfs or
            # glusterfs, etc.
            # See https://kubernetes.io/docs/concepts/storage/volumes.
            #
            # And then he needs to download and extract the example input data
            # from:
            #   https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
            # to:
            #   {Volume Shared Directory}/cifar-10-batches-py
            #
            # For example:
            #nfs:
            #  server: {NFS Server Host}
            #  path: {NFS Shared Directory}
  - name: worker
    taskNumber: 3
    frameworkAttemptCompletionPolicy:
      minFailedTaskCount: 1
      # Succeed the FrameworkAttempt immediately if worker's all Tasks succeeded.
      minSucceededTaskCount: 3
    task:
      retryPolicy:
        fancyRetryPolicy: false
        maxRetryCount: 0
      pod:
        spec:
          restartPolicy: Never
          # [PREREQUISITE]
          # Same as ps TaskRole.
          hostNetwork: false
          containers:
          - name: tensorflow
            image: frameworkcontroller/tensorflow-examples:cpu
            workingDir: /tensorflow/benchmarks/scripts/tf_cnn_benchmarks
            command: [
            "sh", "-c",
            "FB_PS_PORT=4001 FB_WORKER_PORT=5001 . /mnt/frameworkbarrier/injector.sh &&
            python tf_cnn_benchmarks.py --job_name=worker --task_index=${FC_TASK_INDEX}
            --ps_hosts=${FB_PS_ADDRESSES} --worker_hosts=${FB_WORKER_ADDRESSES}
            --variable_update=parameter_server --cross_replica_sync=false
            --model=alexnet --batch_size=8 --num_batches=10
            --device=cpu --local_parameter_device=cpu --data_format=NHWC
            --data_name=cifar10 --data_dir=/mnt/data/cifar-10-batches-py
            --train_dir=/mnt/data/${FC_FRAMEWORK_NAME}/output"]
            ports:
            - containerPort: 5001
            volumeMounts:
            - name: frameworkbarrier-volume
              mountPath: /mnt/frameworkbarrier
            - name: data-volume
              mountPath: /mnt/data
          # [PREREQUISITE]
          # Same as ps TaskRole.
          serviceAccountName: frameworkbarrier
          initContainers:
          - name: frameworkbarrier
            image: frameworkcontroller/frameworkbarrier
            #env:
            #- name: KUBE_APISERVER_ADDRESS
            #  value: {http[s]://host:port}
            #- name: KUBECONFIG
            #  value: {Pod Local KubeConfig File Path}
            volumeMounts:
            - name: frameworkbarrier-volume
              mountPath: /mnt/frameworkbarrier
          volumes:
          - name: frameworkbarrier-volume
            emptyDir: {}
          - name: data-volume
            persistentVolumeClaim:           # pvc
              claimName: data-volume
            # [PREREQUISITE]
            # Same as ps TaskRole.
            #nfs:
            #  server: {NFS Server Host}
            #  path: {NFS Shared Directory}

在创建任务之前，先在挂载的NFS路径下下载数据

cd /data3/nfs-data/pai
wget wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -zxvf cifar-10-python.tar.gz

创建任务

kubectl apply -f tensorflowdistributedtrainingwithcpu.yaml -n default

查看pod

kubectl get pod -n default

NAME                                            READY   STATUS     RESTARTS   AGE
frameworkcontroller-0                           1/1     Running    0          8h
tensorflowdistributedtrainingwithcpu-ps-0       0/1     Init:0/1   0          4s
tensorflowdistributedtrainingwithcpu-ps-1       0/1     Init:0/1   0          4s
tensorflowdistributedtrainingwithcpu-worker-0   0/1     Init:0/1   0          4s
tensorflowdistributedtrainingwithcpu-worker-1   0/1     Init:0/1   0          4s
tensorflowdistributedtrainingwithcpu-worker-2   0/1     Init:0/1   0          4s

查看framework

kubectl get framework -n default

NAME                                   AGE
tensorflowdistributedtrainingwithcpu   35s

过几分钟之后，所有pod运行成功之后都会退出，我们会在NFS挂在的路径下看到训练的结果，都保存在以任务名为文件名的目录下，整个frameworkcontroller的部署和使用过程到此结束。

有问题加QQ群: 526855734

Tilyp

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
openpai的tensorflow利用k8s分布式训练之FrameworkController

简述： openpai在基于YARN的任务调度工具FrameworkLaucher之后又添加了基于K8S的任务调度工具FrameworkController，感觉和kubeflow的TFJob类似，我们先来试试FrameworkController这工具如何单独使用，环境： k8s: 1.15.1 docker: 18.09.5...
复制链接

扫一扫