Kubeflow-NVIDIA-resnet 复现

最新推荐文章于 2024-07-12 23:20:06 发布

困熊TrappedBear

最新推荐文章于 2024-07-12 23:20:06 发布

阅读量1.1k

点赞数

分类专栏： New的小货柜文章标签： nvidia-resnet

本文链接：https://blog.csdn.net/qq_40025335/article/details/102671784

版权

New的小货柜专栏收录该内容

16 篇文章 0 订阅

订阅专栏

Resnet-cifar-10-gpu

nvidia-resnet

1、制作镜像

参考：components

1.1 Webapp

IMAGE=<webapp-image>
git clone https://github.com/NVIDIA/tensorrt-inference-server.git
base=tensorrt-inference-server
docker build -t base-trtis-client -f $base/Dockerfile.client $base
rm -rf $base

# Build & push webapp image
docker build -t $IMAGE .
docker push $IMAGE

结果

1.2 Webapp_launcher

IMAGE=<inference-server-launcher-image>

docker build -t $IMAGE .

docker push $IMAGE

Step4/9:数据下载失败

Preprocess.py数据集下载失败

显示：no shared file

原因：源码中没有函数接口读取存储卷路径中的数据集，而是自动下载数据集

解决：将数据集拷贝到镜像里，位置：/root/.keras/dataset

测试：

root@40915777bc9a:/mnt/workspace# python preprocess.py --input_dir=/mnt/workspace/raw_data/ --output_dir=/mnt/workspace/processed_data/
Using TensorFlow backend.
input_dir: /mnt/workspace/raw_data/
output_dir: /mnt/workspace/processed_data/
root@40915777bc9a:/mnt/workspace# ls
cifar-10-python.tar.gz  preprocess.py  process-test.py  processed_data  raw_data  saved_model
root@40915777bc9a:/mnt/workspace# cd processed_data/
root@40915777bc9a:/mnt/workspace/processed_data# ls
x_test.npy  x_train.npy  y_test.npy  y_train.npy

结果

[root@comput3 preprocess]# docker commit 40915777bc9a 10.18.127.1:5000/preprocess:v0529

新镜像为：10.18.127.1:5000/preprocess:v0529

3、添加label及gpu

preprocess = 
PreprocessOp('preprocess',raw_data_dir,processed_data_dir).add_node_selector_constraint('kubernetes.io/hostname','10.18.127.3').add_resource_limit("nvidia.com/gpu","1").add_resource_request("nvidia.com/gpu", "1")

4、gpu调用erro

错误：

Warning FailedScheduling 40s (x2 over 40s) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were unschedulable
Kubectl describe nodes

没有发现gpu资源，错误排查，定位docker 上，重新安装nVidia-docker2

4.1安装nvidia-docker2

卸载了nvidia-docker 1.0，Installing version 2.0

确保已经安装NVIDIA driver 和被支持的docker版本

CentOS distributions

install the repository for your distribution https://nvidia.github.io/nvidia-docker/

[root@comput3 ~]# yum install nvidia-docker2

Erro

当前docker18.09.2与nvidia-docker2-2.0.3-3.docker18.09.6.ce.noarch 不兼容

小结：重新安装docker-18.09.6

root@comput3 ~]# yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

[root@comput3 ~]# yum list docker-ce --showduplicates | sort -r

4.2 重新安装docker

[root@comput3 ~]# yum install docker-ce-18.09.6 docker-ce-cli-18.09.6 containerd.io

[root@comput3 ~]# systemctl enable docker.service

Created symlink from /etc/systemd/system/multi-user.target.wants/docker.service to /etc/systemd/system/docker.service.

[root@comput3 ~]# systemctl start docker

检查docker状态

[root@comput3 ~]# docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
1b930d010525: Pull complete 
Digest: sha256:0e11c388b664df8a27a901dce21eb89f11d8292f7fca1b3e3c4321bf7897bffe
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.

[root@comput3 ~]# yum install -y nvidia-docker2

[root@comput3 docker]# pkill -SIGHUP dockerd

4.3 配置nvidia-docker

参考：default-runtime

查看系统调用的那个docker 配置文件

[root@comput3 system]# systemctl status docker.service
-add-runtime nvidia=/usr/bin/nvidia-container-runtime --default-runtime nvidia --insecure-registry=10.18.127.1:5000

配置文件：/etc/systemd/docker.service修改docker.service

注释：nvidia-docker-runtime 的默认配置文件 /etc/docker/daemon.json 系统没有调用,

需要：mv /etc/docker/daemon.json /etc/docker/daemon-src.json

更改docker之后节点上pod都挂了，重启集群中机器

节点：systemctl start docker

Master: kubectl get pod --all-namespace

显示pod 状态Runing

4.4 检测gpu 状态

[root@comput1 nxt]# kubectl describe nodes

10.18.127.3 gpu-test

root@comput3 ~]# nvidia-docker run -it --name tf-gpu-test nvcr.io/nvidia/tensorflow:19.03-py3

root@60440df36d9d:/workspace# python

4.5 pod CrashLoopbackoff 占用gpu

显示可用gpu 为0

删除所有pending、CrashLoopbackoff pod

测试

[root@comput1 nxt]# vim gpu-1.yaml 
 apiVersion: v1
 kind: Pod
 metadata:
   name: gpu-pod-11
 spec:
   restartPolicy: Never
   containers:
   - image: nvidia/cuda:9.0-devel       # 这里要指定镜像的tag
     name: cuda
     command: ["nvidia-smi"]
     resources:
       limits:
         nvidia.com/gpu: 1

[root@comput1 nxt]# kubectl apply -f gpu-1.yaml 
[root@comput1 nxt]# kubectl get pod
NAME         READY   STATUS      RESTARTS   AGE
gpu-pod-11   0/1     Completed   0          2m40s

小结：

到此安装完显卡驱动、docker、nvidia-docker 、nvidia-container-runtime

可以通过k8s调用gpu资源；

5.1resnet-case

环境用jupter-notebook

目标：图像分类问题

数据集预处理、模型训练、验证模型准确率、有个webapp模型发布（这里未完成见1）

代码：

import kfp
import kfp.dsl as dsl
import datetime
import os
import kfp.notebook
import kfp.gcp as gcp
client = kfp.Client()
from kubernetes import client as k8s_client
EXPERIMENT_NAME = 'resnet-train-imagesV0529'
exp = client.create_experiment(name=EXPERIMENT_NAME)

# Modify image='<image>' in each op to match IMAGE in the build.sh of its corresponding component

def PreprocessOp(name, input_dir, output_dir):
    return dsl.ContainerOp(
        name=name,
        #image='<preprocess-image>',
        image='10.18.127.1:5000/preprocess:v0529',
        command = ['python', 'preprocess.py'],
        arguments=[
            '--input_dir', input_dir,
            '--output_dir', output_dir,
        ],
        file_outputs={'output': '/output.txt'}
    )


def TrainOp(name, input_dir, output_dir, model_name, model_version, epochs):
    return dsl.ContainerOp(
        name=name,
        #image='<train-image>',
        image='10.18.127.1:5000/train-image:latest',
        arguments=[
            '--input_dir', input_dir,
            '--output_dir', output_dir,
            '--model_name', model_name,
            '--model_version', model_version,
            '--epochs', epochs
        ],
        file_outputs={'output': '/output.txt'}
    )


@dsl.pipeline(
    name='resnet_cifar10_pipeline',
    description='Demonstrate an end-to-end training & serving pipeline using ResNet and CIFAR-10'
)
def resnet_pipeline():
    #erro:container no this file->volume erro->hostPath
    raw_data_dir='/mnt/workspace/raw_data'   
    processed_data_dir='/mnt/workspace/processed_data'
    model_dir='/mnt/workspace/saved_model'
    epochs=50
    #trtserver_name='trtis'
    model_name='resnet_graphdef'
    model_version=1
   # webapp_prefix='webapp'
    #webapp_port=80
    
    persistent_volume_name = 'nvidia-workspace3'
    persistent_volume_path = '/mnt/workspace'
    
    #process = PreprocessOp('preprocess', raw_data_dir, processed_data_dir)
    preprocess = PreprocessOp('preprocess', raw_data_dir, processed_data_dir).add_volume(k8s_client.V1Volume(
        name=persistent_volume_name,
        nfs=k8s_client.V1NFSVolumeSource(path='/mnt/xfs/wgs-data/workspace',server='10.18.129.161'))).add_volume_mount(
        k8s_client.V1VolumeMount(mount_path=persistent_volume_path,name=persistent_volume_name)).add_node_selector_constraint('kubernetes.io/hostname','10.18.127.3').add_resource_limit("nvidia.com/gpu", "1").add_resource_request("nvidia.com/gpu", "1")
    
    train = TrainOp('train', preprocess.outputs['output'], model_dir, model_name, model_version, epochs).add_volume(k8s_client.V1Volume( 
        name=persistent_volume_name,
        nfs=k8s_client.V1NFSVolumeSource(path='/mnt/xfs/wgs-data/workspace',server='10.18.129.161'))).add_volume_mount(
        k8s_client.V1VolumeMount(mount_path=persistent_volume_path,name=persistent_volume_name)).add_node_selector_constraint('kubernetes.io/hostname','10.18.127.3').add_resource_limit("nvidia.com/gpu", "1").add_resource_request("nvidia.com/gpu", "1")

import kfp.compiler as compiler
compiler.Compiler().compile(resnet_pipeline, 'trainV0529.tar.gz')
run = client.run_pipeline(exp.id, 'trainV0529', 'trainV0529.tar.gz')

结果：

5、使用PV做存储一个POD

官网给的pv\pvc.yaml pipeline/src

build.sh

按这种方式复现，将pvc找到对应的pv

参考：persistent-volume-storage

5.1在节点上创建一个index.html文件

sudo mkdir /mnt/data

sudo sh -c "echo 'Hello from Kubernetes storage' > /mnt/data/index.html"

5.2建立一个PV

K8s支持在单节点集群上使用hostPath做开发和测试。hostPathPV支持该节点上使用一个文件或文件夹模拟一个网络存储。

在一个生产集群，不能使用hostPath，使用类似Google Compute Engine persistent disk, an NFS share, or an Amazon Elastic Block Store volume。管理员也可以使用StorageClass去设置动态分配。

pods/storage/pv-volume.yaml

kind: PersistentVolume
apiVersion: v1
metadata:
  name: task-pv-volume
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/data"

它为PV定义StorageClassname manual，这将被用作绑定PVC请求。

建立PV

kubectl apply -f https://k8s.io/examples/pods/storage/pv-volume.yaml

（这里换成本地文件）kubectl apply -f pv-volume.yaml

查看pv信息

kubectl get pv task-pv-volume

5.3建立pvc

Pods使用PVC去请求物理存储。下面例子，将创建一个PVC，请求一个至少3G卷

pods/storage/pv-claim.yaml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: task-pv-claim
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi

查看PVC

kubectl get pvc task-pv-claim

5.4创建一个POD

使用PVC做为卷来创建一个pod

pods/storage/pv-pod.yaml

kind: Pod
apiVersion: v1
metadata:
  name: task-pv-pod
spec:
  volumes:
    - name: task-pv-storage
      persistentVolumeClaim:
       claimName: task-pv-claim
  containers:
    - name: task-pv-container
      image: nginx
      ports:
        - containerPort: 80
          name: "http-server"
      volumeMounts:
        - mountPath: "/usr/share/nginx/html"
          name: task-pv-storage

注意:Pod配置文件指定一个pVC，而不是指定一个PV。从POD角度，pvc就是一个卷。