文章有些长,建议先关注、收藏再阅读:
简介
Kubeflow是在k8s平台之上针对机器学习的开发、训练、优化、部署、管理的工具集合,内部集成的方式融合机器学习中的很多领域的开源项目,比如Jupyter、tfserving、Katib、Fairing、Argo等。可以针对机器学习的不同阶段:数据预处理、模型训练、模型预测、服务管理等进行管理。
一、基础环境准备
k8s版本:v1.20.5
docker版本:v19.03.15
kfctl版本:v1.2.0-0-gbc038f9
kustomize版本:v4.1.3
我也不确定到底能否在1.20.5的k8s版本上完全兼容kubeflow 1.2.0版本。现在只是测试。
版本兼容性可参考:https://www.kubeflow.org/docs/distributions/kfctl/overview#minimum-system-requirements
1、安装kfctl
kfctl 是用于部署和管理 Kubeflow 的控制平面。主要的部署模式是使用 kfctl 作为 CLI,为不同的 Kubernetes 风格配置 KFDef 配置来部署和管理 Kubeflow。
wget https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
tar -xvf kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
chmod 755 kfctl
cp kfctl /usr/bin
kfctl version
2、安装kustomize
Kustomize 是一种配置管理解决方案,它利用分层来保留应用程序和组件的基本设置,方法是覆盖声明性 yaml 工件(称为补丁),这些工件有选择地覆盖默认设置而不实际更改原始文件。
下载地址:https://github.com/kubernetes-sigs/kustomize/releases
wget https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2Fv4.1.3/kustomize_v4.1.3_linux_amd64.tar.gz
tar -xzvf kustomize_v4.1.3_linux_amd64.tar.gz
chmod 755 kustomize
mv kustomize /use/bin/
kustomize version
三、基于公网的部署
如果你的服务器能够访问外网。就可直接执行安装部署。
本次测试部署使用的阿里云美国西部1(硅谷)的机器。
1、创建kubeflow的工作目录
mkdir /apps/kubeflow
cd /apps/kubeflow
2、配置storageclass
# cat storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: alicloud-nas
mountOptions:
- nolock,tcp,noresvport
- vers=3
parameters:
volumeAs: subpath
server: "*********.us-west-1.nas.aliyuncs.com:/nasroot1/" #这里使用的是阿里的NAS存储
archiveOnDelete: "false"
provisioner: nasplugin.csi.alibabacloud.com
reclaimPolicy: Retain
3、设置为默认的storageclass
# kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
alicloud-nas nasplugin.csi.alibabacloud.com Retain Immediate false 24h
# 为false时为关闭默认
# kubectl patch storageclass alicloud-nas -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
# kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
alicloud-nas (default) nasplugin.csi.alibabacloud.com Retain Immediate false 24h
4、安装部署
wget https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml
kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml
等所有pod都创建成功后检查各个pod
保证以下所有的pod都是Running状态。
# kubectl get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-7c75b559c4-c2hhj 1/1 Running 0 23h
cert-manager-cainjector-7f964fd7b5-mxbjl 1/1 Running 0 23h
cert-manager-webhook-566dd99d6-6vvzv 1/1 Running 2 23h
# kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
cluster-local-gateway-5898bc5c74-822c9 1/1 Running 0 23h
cluster-local-gateway-5898bc5c74-b5tmr 1/1 Running 0 23h
cluster-local-gateway-5898bc5c74-fpswf 1/1 Running 0 23h
istio-citadel-6dffd79d7-4scx7 1/1 Running 0 23h
istio-galley-77cb9b44dc-6l4lm 1/1 Running 0 23h
istio-ingressgateway-7bb77f89b8-psqcm 1/1 Running 0 23h
istio-nodeagent-5qsmg 1/1 Running 0 23h
istio-nodeagent-ccc8j 1/1 Running 0 23h
istio-nodeagent-gqrsl 1/1 Running 0 23h
istio-pilot-67d94fc954-vl2sx 2/2 Running 0 23h
istio-policy-546596d4b4-6ct59 2/2 Running 1 23h
istio-security-post-install-release-1.3-latest-daily-qbrf6 0/1 Completed 0 23h
istio-sidecar-injector-796b6454d9-lv8dg 1/1 Running 0 23h
istio-telemetry-58f9cd4bf5-8cjj5 2/2 Running 1 23h
prometheus-7c6d764c48-s29kn 1/1 Running 0 23h
# kubectl get pods -n knative-serving
NAME READY STATUS RESTARTS AGE
activator-6c87fcbbb6-f4cs2 1/1 Running 0 23h
autoscaler-847b9f89dc-5jvml 1/1 Running 0 23h
controller-55f67c9ddb-67vvc 1/1 Running 0 23h
istio-webhook-db664df87-jn72n 1/1 Running 0 23h
networking-istio-76f8cc7796-9jr2j 1/1 Running 0 23h
webhook-6bff77594b-2r2gx 1/1 Running 0 23h
# kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
admission-webhook-bootstrap-stateful-set-0 1/1 Running 4 23h
admission-webhook-deployment-5cd7dc96f5-fw7d4 1/1 Running 2 23h
application-controller-stateful-set-0 1/1 Running 0 23h
argo-ui-65df8c7c84-qwtc8 1/1 Running 0 23h
cache-deployer-deployment-5f4979f45-2xqbf 2/2 Running 2 23h
cache-server-7859fd67f5-hplhm 2/2 Running 0 23h
centraldashboard-67767584dc-j9ffz 1/1 Running 0 23h
jupyter-web-app-deployment-8486d5ffff-hmbz4 1/1 Running 0 23h
katib-controller-7fcc95676b-rn98v 1/1 Running 1 23h
katib-db-manager-85db457c64-jx97j 1/1 Running 0 23h
katib-mysql-6c7f7fb869-bt87c 1/1 Running 0 23h
katib-ui-65dc4cf6f5-nhmsg 1/1 Running 0 23h
kfserving-controller-manager-0 2/2 Running 0 23h
kubeflow-pipelines-profile-controller-797fb44db9-rqzmg 1/1 Running 0 23h
metacontroller-0 1/1 Running 0 23h
metadata-db-6dd978c5b-zzntn 1/1 Running 0 23h
metadata-envoy-deployment-67bd5954c-zvpf4 1/1 Running 0 23h
meta