基于kubernetes的tensorflow单机训练与服务部署

最新推荐文章于 2024-08-10 14:49:39 发布

xingyuzhe

最新推荐文章于 2024-08-10 14:49:39 发布

阅读量1.1k

点赞数

分类专栏： Kubernetes 文章标签： tensorflow mnist kubernetes

本文链接：https://blog.csdn.net/xingyuzhe/article/details/83863823

版权

Kubernetes 专栏收录该内容

12 篇文章 4 订阅

订阅专栏

目标：使用kubernetes实现tensorflow mnist的训练与服务部署

环境：kubernetes 1.11集群 / gpu: nvidia p100

步骤：mnist深度学习样例简介->模型训练->模型服务->服务测试

1.mnist深度学习样例简介

MNIST - 深度学习经典入门样例

--NIST数据集的一个子集

--包含60000张图片作为训练数据，10000张图片作为测试数据

--每一张图片都代表0~9中的一个数字

--图片大小为28*28

--数字出现在图片正中间

数据集地址

--http://yann.lecun.com/exdb/mnist

参考代码：https://github.com/tensorflow/serving/tree/master/tensorflow_serving/example

mnist_input_data.py：数据读入部分的代码

mnist_saved_model.py：模型训练与存储部分代码

mnist_client.py：客户端测试模型服务代码

代码需要修改的部分为工作目录、模型输出目录等运行所需环境信息

2.模型训练

训练使用kubernetes中的Job进行：

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    job: caoxin-train-1
  name: caoxin-train-1
  namespace: default
spec:
  template:
    spec:
      containers:
      - command:
        - /bin/sh
        - -c
        - cd /data;python mnist_saved_model.py --training_iteration=10000 --model_version=2
          /data/mnist
        image: tensorflow/tensorflow:1.12.0-gpu-py3
        name: caoxin-train-1
        ports:
        - containerPort: 2222
          protocol: TCP
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /data
          name: data
      restartPolicy: OnFailure
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: caoxin-dev

其中持久化存储改成集群中的cephfs

3.模型服务

基于tensorflow/serving镜像进行部署

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    serve: caoxin-serve
  name: caoxin-serve
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      serve: caoxin-serve
  template:
    spec:
      containers:
      - command:
        - /bin/sh
        - -c
        - tensorflow_model_server --port=9000 --model_name=mnist --model_base_path=/data/mnist
        image: tensorflow/serving:1.12.0-gpu
        name: caoxin-serve
        ports:
        - containerPort: 9000
          protocol: TCP
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
            nvidia.com/gpu: "2"
        volumeMounts:
        - mountPath: /data
          name: data
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: caoxin-dev

部署Service

apiVersion: v1
kind: Service
metadata:
  labels:
    serve: caoxin-serve
  name: caoxin-serve
  namespace: default
spec:
  ports:
  - nodePort: 30053
    port: 80
    protocol: TCP
    targetPort: 9000
  selector:
    serve: caoxin-serve
  type: NodePort

4.服务测试

训练日志：