Resnet-cifar-10-gpu
1、制作镜像
参考:components
1.1 Webapp
IMAGE=<webapp-image>
git clone https://github.com/NVIDIA/tensorrt-inference-server.git
base=tensorrt-inference-server
docker build -t base-trtis-client -f $base/Dockerfile.client $base
rm -rf $base
# Build & push webapp image
docker build -t $IMAGE .
docker push $IMAGE
结果
1.2 Webapp_launcher
IMAGE=<inference-server-launcher-image>
docker build -t $IMAGE .
docker push $IMAGE
Step4/9:数据下载失败
- Preprocess.py数据集下载失败
显示:no shared file
原因:源码中没有函数接口读取存储卷路径中的数据集,而是自动下载数据集
解决:将数据集拷贝到镜像里,位置:/root/.keras/dataset
测试:
root@40915777bc9a:/mnt/workspace# python preprocess.py --input_dir=/mnt/workspace/raw_data/ --output_dir=/mnt/workspace/processed_data/
Using TensorFlow backend.
input_dir: /mnt/workspace/raw_data/
output_dir: /mnt/workspace/processed_data/
root@40915777bc9a:/mnt/workspace# ls
cifar-10-python.tar.gz preprocess.py process-test.py processed_data raw_data saved_model
root@40915777bc9a:/mnt/workspace# cd processed_data/
root@40915777bc9a:/mnt/workspace/processed_data# ls
x_test.npy x_train.npy y_test.npy y_train.npy
结果
[root@comput3 preprocess]# docker commit 40915777bc9a 10.18.127.1:5000/preprocess:v0529
新镜像为:10.18.127.1:5000/preprocess:v0529
3、添加label及gpu
preprocess =
PreprocessOp('preprocess',raw_data_dir,processed_data_dir).add_node_selector_constraint('kubernetes.io/hostname','10.18.127.3').add_resource_limit("nvidia.com/gpu","1").add_resource_request("nvidia.com/gpu", "1")
4、gpu调用erro
错误:
Warning FailedScheduling 40s (x2 over 40s) default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) were unschedulable
Kubectl describe nodes
没有发现gpu资源,错误排查,定位docker 上,重新安装nVidia-docker2
4.1安装nvidia-docker2
卸载了nvidia-docker 1.0,Installing version 2.0
确保已经安装NVIDIA driver 和 被支持的docker版本
CentOS distributions
install the repository for your distribution https://nvidia.github.io/nvidia-docker/
[root@comput3 ~]# yum install nvidia-docker2
Erro
当前docker18.09.2与nvidia-docker2-2.0.3-3.docker18.09.6.ce.noarch 不兼容
小结:重新安装docker-18.09.6
root@comput3 ~]# yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
[root@comput3 ~]# yum list docker-ce --showduplicates | sort -r
4.2 重新安装docker
[root@comput3 ~]# yum install docker-ce-18.09.6 docker-ce-cli-18.09.6 containerd.io
[root@comput3 ~]# systemctl enable docker.service
Created symlink from /etc/systemd/system/multi-user.target.wants/docker.service to /etc/systemd/system/docker.service.
[root@comput3 ~]# systemctl start docker
检查docker状态
[root@comput3 ~]# docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
1b930d010525: Pull complete
Digest: sha256:0e11c388b664df8a27a901dce21eb89f11d8292f7fca1b3e3c4321bf7897bffe
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
[root@comput3 ~]# yum install -y nvidia-docker2
[root@comput3 docker]# pkill -SIGHUP dockerd
4.3 配置nvidia-docker
查看系统调用的那个docker 配置文件
[root@comput3 system]# systemctl status docker.service
-add-runtime nvidia=/usr/bin/nvidia-container-runtime --default-runtime nvidia --insecure-registry=10.18.127.1:5000
配置文件:/etc/systemd/docker.service修改docker.service
注释:nvidia-docker-runtime 的默认配置文件 /etc/docker/daemon.json 系统没有调用,
需要:mv /etc/docker/daemon.json /etc/docker/daemon-src.json
更改docker之后节点上pod都挂了,重启集群中机器
节点:systemctl start docker
Master: kubectl get pod --all-namespace
显示pod 状态Runing
4.4 检测gpu 状态
[root@comput1 nxt]# kubectl describe nodes
10.18.127.3 gpu-test
root@comput3 ~]# nvidia-docker run -it --name tf-gpu-test nvcr.io/nvidia/tensorflow:19.03-py3
root@60440df36d9d:/workspace# python
4.5 pod CrashLoopbackoff 占用gpu
显示可用gpu 为0
删除所有pending、CrashLoopbackoff pod
测试
[root@comput1 nxt]# vim gpu-1.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod-11
spec:
restartPolicy: Never
containers:
- image: nvidia/cuda:9.0-devel # 这里要指定镜像的tag
name: cuda
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
[root@comput1 nxt]# kubectl apply -f gpu-1.yaml
[root@comput1 nxt]# kubectl get pod
NAME READY STATUS RESTARTS AGE
gpu-pod-11 0/1 Completed 0 2m40s
小结:
到此安装完显卡驱动、docker、nvidia-docker 、nvidia-container-runtime
可以通过k8s调用gpu资源;
5.1resnet-case
环境用jupter-notebook
目标:图像分类问题
数据集预处理、模型训练、验证模型准确率、有个webapp模型发布(这里未完成见1)
代码:
import kfp
import kfp.dsl as dsl
import datetime
import os
import kfp.notebook
import kfp.gcp as gcp
client = kfp.Client()
from kubernetes import client as k8s_client
EXPERIMENT_NAME = 'resnet-train-imagesV0529'
exp = client.create_experiment(name=EXPERIMENT_NAME)
# Modify image='<image>' in each op to match IMAGE in the build.sh of its corresponding component
def PreprocessOp(name, input_dir, output_dir):
return dsl.ContainerOp(
name=name,
#image='<preprocess-image>',
image='10.18.127.1:5000/preprocess:v0529',
command = ['python', 'preprocess.py'],
arguments=[
'--input_dir', input_dir,
'--output_dir', output_dir,
],
file_outputs={'output': '/output.txt'}
)
def TrainOp(name, input_dir, output_dir, model_name, model_version, epochs):
return dsl.ContainerOp(
name=name,
#image='<train-image>',
image='10.18.127.1:5000/train-image:latest',
arguments=[
'--input_dir', input_dir,
'--output_dir', output_dir,
'--model_name', model_name,
'--model_version', model_version,
'--epochs', epochs
],
file_outputs={'output': '/output.txt'}
)
@dsl.pipeline(
name='resnet_cifar10_pipeline',
description='Demonstrate an end-to-end training & serving pipeline using ResNet and CIFAR-10'
)
def resnet_pipeline():
#erro:container no this file->volume erro->hostPath
raw_data_dir='/mnt/workspace/raw_data'
processed_data_dir='/mnt/workspace/processed_data'
model_dir='/mnt/workspace/saved_model'
epochs=50
#trtserver_name='trtis'
model_name='resnet_graphdef'
model_version=1
# webapp_prefix='webapp'
#webapp_port=80
persistent_volume_name = 'nvidia-workspace3'
persistent_volume_path = '/mnt/workspace'
#process = PreprocessOp('preprocess', raw_data_dir, processed_data_dir)
preprocess = PreprocessOp('preprocess', raw_data_dir, processed_data_dir).add_volume(k8s_client.V1Volume(
name=persistent_volume_name,
nfs=k8s_client.V1NFSVolumeSource(path='/mnt/xfs/wgs-data/workspace',server='10.18.129.161'))).add_volume_mount(
k8s_client.V1VolumeMount(mount_path=persistent_volume_path,name=persistent_volume_name)).add_node_selector_constraint('kubernetes.io/hostname','10.18.127.3').add_resource_limit("nvidia.com/gpu", "1").add_resource_request("nvidia.com/gpu", "1")
train = TrainOp('train', preprocess.outputs['output'], model_dir, model_name, model_version, epochs).add_volume(k8s_client.V1Volume(
name=persistent_volume_name,
nfs=k8s_client.V1NFSVolumeSource(path='/mnt/xfs/wgs-data/workspace',server='10.18.129.161'))).add_volume_mount(
k8s_client.V1VolumeMount(mount_path=persistent_volume_path,name=persistent_volume_name)).add_node_selector_constraint('kubernetes.io/hostname','10.18.127.3').add_resource_limit("nvidia.com/gpu", "1").add_resource_request("nvidia.com/gpu", "1")
import kfp.compiler as compiler
compiler.Compiler().compile(resnet_pipeline, 'trainV0529.tar.gz')
run = client.run_pipeline(exp.id, 'trainV0529', 'trainV0529.tar.gz')
结果:
5、使用PV做存储一个POD
官网给的pv\pvc.yaml pipeline/src
按这种方式复现,将pvc找到对应的pv
5.1在节点上创建一个index.html文件
sudo mkdir /mnt/data
sudo sh -c "echo 'Hello from Kubernetes storage' > /mnt/data/index.html"
5.2建立一个PV
K8s支持在单节点集群上使用hostPath做开发和测试。hostPathPV支持该节点上使用一个文件或文件夹模拟一个网络存储。
在一个生产集群,不能使用hostPath,使用类似Google Compute Engine persistent disk, an NFS share, or an Amazon Elastic Block Store volume。管理员也可以使用StorageClass去设置动态分配。
pods/storage/pv-volume.yaml
kind: PersistentVolume
apiVersion: v1
metadata:
name: task-pv-volume
labels:
type: local
spec:
storageClassName: manual
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/data"
它为PV定义StorageClassname manual,这将被用作绑定PVC请求。
建立PV
kubectl apply -f https://k8s.io/examples/pods/storage/pv-volume.yaml
(这里换成本地文件)kubectl apply -f pv-volume.yaml
查看pv信息
kubectl get pv task-pv-volume
5.3建立pvc
Pods使用PVC去请求物理存储。下面例子,将创建一个PVC,请求一个至少3G卷
pods/storage/pv-claim.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: task-pv-claim
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
查看PVC
kubectl get pvc task-pv-claim
5.4创建一个POD
使用PVC做为卷来创建一个pod
pods/storage/pv-pod.yaml
kind: Pod
apiVersion: v1
metadata:
name: task-pv-pod
spec:
volumes:
- name: task-pv-storage
persistentVolumeClaim:
claimName: task-pv-claim
containers:
- name: task-pv-container
image: nginx
ports:
- containerPort: 80
name: "http-server"
volumeMounts:
- mountPath: "/usr/share/nginx/html"
name: task-pv-storage
注意:Pod配置文件指定一个pVC,而不是指定一个PV。从POD角度,pvc就是一个卷。