Kubeflow-NVIDIA-resnet 复现

Resnet-cifar-10-gpu

nvidia-resnet

1、制作镜像

参考:components

1.1 Webapp

IMAGE=<webapp-image>
git clone https://github.com/NVIDIA/tensorrt-inference-server.git
base=tensorrt-inference-server
docker build -t base-trtis-client -f $base/Dockerfile.client $base
rm -rf $base

# Build & push webapp image
docker build -t $IMAGE .
docker push $IMAGE

结果

1.2 Webapp_launcher

IMAGE=<inference-server-launcher-image>

docker build -t $IMAGE .

docker push $IMAGE

Step4/9:数据下载失败

  1. Preprocess.py数据集下载失败

显示:no shared file

原因:源码中没有函数接口读取存储卷路径中的数据集,而是自动下载数据集

解决:将数据集拷贝到镜像里,位置:/root/.keras/dataset

测试:

root@40915777bc9a:/mnt/workspace# python preprocess.py --input_dir=/mnt/workspace/raw_data/ --output_dir=/mnt/workspace/processed_data/
Using TensorFlow backend.
input_dir: /mnt/workspace/raw_data/
output_dir: /mnt/workspace/processed_data/
root@40915777bc9a:/mnt/workspace# ls
cifar-10-python.tar.gz  preprocess.py  process-test.py  processed_data  raw_data  saved_model
root@40915777bc9a:/mnt/workspace# cd processed_data/
root@40915777bc9a:/mnt/workspace/processed_data# ls
x_test.npy  x_train.npy  y_test.npy  y_train.npy

结果

[root@comput3 preprocess]# docker commit 40915777bc9a 10.18.127.1:5000/preprocess:v0529

新镜像为:10.18.127.1:5000/preprocess:v0529

3、添加label及gpu

preprocess = 
PreprocessOp('preprocess',raw_data_dir,processed_data_dir).add_node_selector_constraint('kubernetes.io/hostname','10.18.127.3').add_resource_limit("nvidia.com/gpu","1").add_resource_request("nvidia.com/gpu", "1")

4、gpu调用erro

错误:

Warning FailedScheduling 40s (x2 over 40s) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were unschedulable
Kubectl describe nodes

没有发现gpu资源,错误排查,定位docker 上,重新安装nVidia-docker2

4.1安装nvidia-docker2

卸载了nvidia-docker 1.0,Installing version 2.0

确保已经安装NVIDIA driver 和 被支持的docker版本

CentOS distributions

install the repository for your distribution https://nvidia.github.io/nvidia-docker/

[root@comput3 ~]# yum install nvidia-docker2

Erro

当前docker18.09.2与nvidia-docker2-2.0.3-3.docker18.09.6.ce.noarch 不兼容

小结:重新安装docker-18.09.6

root@comput3 ~]# yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

[root@comput3 ~]# yum list docker-ce --showduplicates | sort -r

4.2 重新安装docker

[root@comput3 ~]# yum install docker-ce-18.09.6 docker-ce-cli-18.09.6 containerd.io

[root@comput3 ~]# systemctl enable docker.service

Created symlink from /etc/systemd/system/multi-user.target.wants/docker.service to /etc/systemd/system/docker.service.

[root@comput3 ~]# systemctl start docker

检查docker状态

[root@comput3 ~]# docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
1b930d010525: Pull complete 
Digest: sha256:0e11c388b664df8a27a901dce21eb89f11d8292f7fca1b3e3c4321bf7897bffe
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
[root@comput3 ~]# yum install -y nvidia-docker2

[root@comput3 docker]# pkill -SIGHUP dockerd

4.3 配置nvidia-docker

参考:default-runtime

查看系统调用的那个docker 配置文件

[root@comput3 system]# systemctl status docker.service
-add-runtime nvidia=/usr/bin/nvidia-container-runtime --default-runtime nvidia --insecure-registry=10.18.127.1:5000

 

配置文件:/etc/systemd/docker.service修改docker.service

注释:nvidia-docker-runtime 的默认配置文件 /etc/docker/daemon.json 系统没有调用,

需要:mv /etc/docker/daemon.json /etc/docker/daemon-src.json

 

 

更改docker之后节点上pod都挂了,重启集群中机器

节点:systemctl start docker

Master: kubectl get pod --all-namespace

显示pod 状态Runing

4.4 检测gpu 状态

[root@comput1 nxt]# kubectl describe nodes

10.18.127.3 gpu-test

root@comput3 ~]# nvidia-docker run -it --name tf-gpu-test nvcr.io/nvidia/tensorflow:19.03-py3

root@60440df36d9d:/workspace# python

4.5 pod CrashLoopbackoff 占用gpu

  显示可用gpu 为0

删除所有pending、CrashLoopbackoff  pod

测试

[root@comput1 nxt]# vim gpu-1.yaml 
 apiVersion: v1
 kind: Pod
 metadata:
   name: gpu-pod-11
 spec:
   restartPolicy: Never
   containers:
   - image: nvidia/cuda:9.0-devel       # 这里要指定镜像的tag
     name: cuda
     command: ["nvidia-smi"]
     resources:
       limits:
         nvidia.com/gpu: 1
[root@comput1 nxt]# kubectl apply -f gpu-1.yaml 
[root@comput1 nxt]# kubectl get pod
NAME         READY   STATUS      RESTARTS   AGE
gpu-pod-11   0/1     Completed   0          2m40s

小结:

到此安装完显卡驱动、docker、nvidia-docker 、nvidia-container-runtime

可以通过k8s调用gpu资源;

5.1resnet-case

  环境用jupter-notebook

目标:图像分类问题

数据集预处理、模型训练、验证模型准确率、有个webapp模型发布(这里未完成见1)

代码:

import kfp
import kfp.dsl as dsl
import datetime
import os
import kfp.notebook
import kfp.gcp as gcp
client = kfp.Client()
from kubernetes import client as k8s_client
EXPERIMENT_NAME = 'resnet-train-imagesV0529'
exp = client.create_experiment(name=EXPERIMENT_NAME)
# Modify image='<image>' in each op to match IMAGE in the build.sh of its corresponding component

def PreprocessOp(name, input_dir, output_dir):
    return dsl.ContainerOp(
        name=name,
        #image='<preprocess-image>',
        image='10.18.127.1:5000/preprocess:v0529',
        command = ['python', 'preprocess.py'],
        arguments=[
            '--input_dir', input_dir,
            '--output_dir', output_dir,
        ],
        file_outputs={'output': '/output.txt'}
    )


def TrainOp(name, input_dir, output_dir, model_name, model_version, epochs):
    return dsl.ContainerOp(
        name=name,
        #image='<train-image>',
        image='10.18.127.1:5000/train-image:latest',
        arguments=[
            '--input_dir', input_dir,
            '--output_dir', output_dir,
            '--model_name', model_name,
            '--model_version', model_version,
            '--epochs', epochs
        ],
        file_outputs={'output': '/output.txt'}
    )


@dsl.pipeline(
    name='resnet_cifar10_pipeline',
    description='Demonstrate an end-to-end training & serving pipeline using ResNet and CIFAR-10'
)
def resnet_pipeline():
    #erro:container no this file->volume erro->hostPath
    raw_data_dir='/mnt/workspace/raw_data'   
    processed_data_dir='/mnt/workspace/processed_data'
    model_dir='/mnt/workspace/saved_model'
    epochs=50
    #trtserver_name='trtis'
    model_name='resnet_graphdef'
    model_version=1
   # webapp_prefix='webapp'
    #webapp_port=80
    
    persistent_volume_name = 'nvidia-workspace3'
    persistent_volume_path = '/mnt/workspace'
    
    #process = PreprocessOp('preprocess', raw_data_dir, processed_data_dir)
    preprocess = PreprocessOp('preprocess', raw_data_dir, processed_data_dir).add_volume(k8s_client.V1Volume(
        name=persistent_volume_name,
        nfs=k8s_client.V1NFSVolumeSource(path='/mnt/xfs/wgs-data/workspace',server='10.18.129.161'))).add_volume_mount(
        k8s_client.V1VolumeMount(mount_path=persistent_volume_path,name=persistent_volume_name)).add_node_selector_constraint('kubernetes.io/hostname','10.18.127.3').add_resource_limit("nvidia.com/gpu", "1").add_resource_request("nvidia.com/gpu", "1")
    
    train = TrainOp('train', preprocess.outputs['output'], model_dir, model_name, model_version, epochs).add_volume(k8s_client.V1Volume( 
        name=persistent_volume_name,
        nfs=k8s_client.V1NFSVolumeSource(path='/mnt/xfs/wgs-data/workspace',server='10.18.129.161'))).add_volume_mount(
        k8s_client.V1VolumeMount(mount_path=persistent_volume_path,name=persistent_volume_name)).add_node_selector_constraint('kubernetes.io/hostname','10.18.127.3').add_resource_limit("nvidia.com/gpu", "1").add_resource_request("nvidia.com/gpu", "1")
import kfp.compiler as compiler
compiler.Compiler().compile(resnet_pipeline, 'trainV0529.tar.gz')
run = client.run_pipeline(exp.id, 'trainV0529', 'trainV0529.tar.gz')

结果:

 

5、使用PV做存储一个POD

  官网给的pv\pvc.yaml   pipeline/src

build.sh

按这种方式复现,将pvc找到对应的pv

参考:persistent-volume-storage

5.1在节点上创建一个index.html文件

sudo mkdir /mnt/data

sudo sh -c "echo 'Hello from Kubernetes storage' > /mnt/data/index.html"

5.2建立一个PV

K8s支持在单节点集群上使用hostPath做开发和测试。hostPathPV支持该节点上使用一个文件或文件夹模拟一个网络存储。

在一个生产集群,不能使用hostPath,使用类似Google Compute Engine persistent disk, an NFS share, or an Amazon Elastic Block Store volume。管理员也可以使用StorageClass去设置动态分配。

pods/storage/pv-volume.yaml

kind: PersistentVolume
apiVersion: v1
metadata:
  name: task-pv-volume
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/data"

它为PV定义StorageClassname manual,这将被用作绑定PVC请求。

建立PV

kubectl apply -f https://k8s.io/examples/pods/storage/pv-volume.yaml

(这里换成本地文件)kubectl apply -f pv-volume.yaml

查看pv信息

kubectl get pv task-pv-volume

5.3建立pvc

Pods使用PVC去请求物理存储。下面例子,将创建一个PVC,请求一个至少3G卷

pods/storage/pv-claim.yaml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: task-pv-claim
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi

查看PVC

kubectl get pvc task-pv-claim

5.4创建一个POD

使用PVC做为卷来创建一个pod

pods/storage/pv-pod.yaml

kind: Pod
apiVersion: v1
metadata:
  name: task-pv-pod
spec:
  volumes:
    - name: task-pv-storage
      persistentVolumeClaim:
       claimName: task-pv-claim
  containers:
    - name: task-pv-container
      image: nginx
      ports:
        - containerPort: 80
          name: "http-server"
      volumeMounts:
        - mountPath: "/usr/share/nginx/html"
          name: task-pv-storage

注意:Pod配置文件指定一个pVC,而不是指定一个PV。从POD角度,pvc就是一个卷。

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值