0 前言
予读者言:
本系列博客本义作为笔者记录所用,所以可能稍显冗长,但同时也记录了我的学习研究思路,会在学习的过程中不断更新,可供读者借鉴,能对大家有些许帮助就是笔者最为开心之事~
1 Install Kubeflow With Outside-Network(FAILED)
1.1 查看节点硬件支持
查看内存
查看内存大小:
user@node01:~$ free -m # -m 代表以兆为单位显示,也可为k/g
总计 已用 空闲 共享 缓冲/缓存 可用
内存: 128601 4026 121630 106 2945 123273
交换: 0 0 0
root@master:/home/hqc# free -m
总计 已用 空闲 共享 缓冲/缓存 可用
内存: 64089 3733 53620 202 6735 59561
交换: 0 0 0
INFO
:node01内存总共125G左右,master内存62G左右
查看内存使用率:
# 安装htop
user@node01:~$ sudo apt install htop
# 查看
user@node01:~$ htop
查看cpu
cpu型号:
user@node01:~$ grep "model name" /proc/cpuinfo |awk -F ':' '{print $NF}'
Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
INFO
:cpu是i9版本的
cpu详细信息:
user@node01:~$ cat /proc/cpuinfo
processor : 22
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
stepping : 7
microcode : 0x5003102
cpu MHz : 1200.002
cache size : 25344 KB
physical id : 0
siblings : 36
core id : 4
cpu cores : 18 # cpu核心数,18核
apicid : 9
initial apicid : 9
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti retpoline mba rsb_ctxsw tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512_vnni
bugs : cpu_meltdown spectre_v1 spectre_v2
bogomips : 6000.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
查看硬盘
user@node01:~$ sudo fdisk -l |grep "Disk /dev/sd"
Disk /dev/sda:465.8 GiB,500107862016 字节,976773168 个扇区
INFO
:硬盘大小为465.8 GB
user@node01:~$ df -lh
文件系统 容量 已用 可用 已用% 挂载点
udev 63G 0 63G 0% /dev
tmpfs 13G 2.6M 13G 1% /run
/dev/nvme0n1p7 469G 43G 402G 10% /
tmpfs 63G 46M 63G 1% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/loop0 640K 640K 0 100% /snap/gnome-logs/103
/dev/loop1 56M 56M 0 100% /snap/core18/2284
/dev/loop2 62M 62M 0 100% /snap/core20/1405
/dev/loop3 249M 249M 0 100% /snap/gnome-3-38-2004/99
/dev/loop4 128K 128K 0 100% /snap/bare/5
/dev/loop6 66M 66M 0 100% /snap/gtk-common-themes/1515
/dev/loop5 45M 45M 0 100% /snap/snapd/15314
/dev/loop7 2.5M 2.5M 0 100% /snap/gnome-calculator/884
/dev/loop8 2.3M 2.3M 0 100% /snap/gnome-system-monitor/157
/dev/loop9 640K 640K 0 100% /snap/gnome-logs/106
/dev/loop10 44M 44M 0 100% /snap/snapd/15177
/dev/loop11 219M 219M 0 100% /snap/gnome-3-34-1804/77
/dev/loop12 219M 219M 0 100% /snap/gnome-3-34-1804/72
/dev/loop13 768K 768K 0 100% /snap/gnome-characters/741
/dev/loop14 768K 768K 0 100% /snap/gnome-characters/761
/dev/loop15 62M 62M 0 100% /snap/core20/1376
/dev/loop16 2.7M 2.7M 0 100% /snap/gnome-calculator/920
/dev/loop17 2.7M 2.7M 0 100% /snap/gnome-system-monitor/174
/dev/loop18 66M 66M 0 100% /snap/gtk-common-themes/1519
/dev/loop19 56M 56M 0 100% /snap/core18/2344
/dev/loop20 248M 248M 0 100% /snap/gnome-3-38-2004/87
/dev/nvme0n1p5 735M 117M 565M 18% /boot
/dev/nvme0n1p8 438G 33G 383G 8% /home
/dev/sda1 256M 32M 225M 13% /boot/efi
tmpfs 13G 16K 13G 1% /run/user/121
tmpfs 13G 80K 13G 1% /run/user/1000
root@master:/home/hqc# df -lh
文件系统 容量 已用 可用 已用% 挂载点
udev 32G 0 32G 0% /dev
tmpfs 6.3G 2.9M 6.3G 1% /run
/dev/nvme0n1p6 29G 3.4G 24G 13% /
/dev/nvme0n1p10 94G 40G 49G 45% /usr
tmpfs 32G 50M 32G 1% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/nvme0n1p9 9.4G 123M 8.8G 2% /tmp
/dev/nvme0n1p7 946M 176M 706M 20% /boot
/dev/nvme0n1p11 9.4G 7.7G 1.2G 87% /var
/dev/nvme0n1p8 47G 20G 25G 45% /home
INFO
:master比node01要存储空间小很多,恐怕只有master中的/usr
还勉强满足要求
查看ubuntu的位数
user@node01:~$ getconf LONG_BIT
64
本篇总结
Kubeflow 要求:
- 集群中至少有一个工作节点。☑
- 每个节点要求至少4核心cpu,50GB的存储空间,以及12GB的内存;而本node01是18内核,456GB+存储空间和125GB的内存。远远大于最低要求。☑
- Kubernetes要求版本大于1.11,本集群为1.18版本。☑
1.2 查看版本匹配关系
我的集群是1.18版本的,但是目前没有完全测试过的Kubeflow版本,毕竟也没法升级Kubernetes版本,还是决定选用目前最新版本1.2。因为看有别的朋友安装成功了,参考链接。
1.3 下载必要文件
需要两个东西:kfctl
以及 kfctl_k8s_istio.v1.2.0.yaml
下载kfctl
yaml文件地址
由于不知道怎么把这个文件下载下来,我直接复制粘贴了。
1.4 解压kfctl压缩包
user@node01:~/Kubeflow$ tar -xvf kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
./kfctl
user@node01:~/Kubeflow$ ls
kfctl kfctl_k8s_istio.v1.2.0.yaml kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
1.5 将 kfctl
移到 /usr/bin
目录中
这样就不用配置环境变量了
user@node01:~/Kubeflow$ sudo mv kfctl /usr/bin
[sudo] user 的密码:
user@node01:~/Kubeflow$ ls
kfctl_k8s_istio.v1.2.0.yaml kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
1.6 查看kfctl
user@node01:~/Kubeflow$ which kfctl
/usr/bin/kfctl
user@node01:~/Kubeflow$ ll /usr/bin/ | grep kfctl
-rwxr-xr-x 1 user user 83424955 11月 21 2020 kfctl*
1.7 配置环境变量
增加最后三行:
export KF_NAME=<自己起一个Kubeflow应用名称>
export BASE_DIR=<自己设一个根目录>
export KF_DIR=${BASE_DIR}/${KF_NAME} # kubeflow应用存放路径
完整内容如下:
user@node01:~/Kubeflow$ sudo vi /etc/profile
# /etc/profile: system-wide .profile file for the Bourne shell (sh(1))
# and Bourne compatible shells (bash(1), ksh(1), ash(1), ...).
if [ "${PS1-}" ]; then
if [ "${BASH-}" ] && [ "$BASH" != "/bin/sh" ]; then
# The file bash.bashrc already sets the default PS1.
# PS1='\h:\w\$ '
if [ -f /etc/bash.bashrc ]; then
. /etc/bash.bashrc
fi
else
if [ "`id -u`" -eq 0 ]; then
PS1='# '
else
PS1='$ '
fi
fi
fi
if [ -d /etc/profile.d ]; then
for i in /etc/profile.d/*.sh; do
if [ -r $i ]; then
. $i
fi
done
unset i
fi
export KF_NAME=Kubeflow1.2.0
export BASE_DIR=~/Kubeflow
export KF_DIR=${BASE_DIR}/${KF_NAME} # kubeflow应用存放路径
刷新环境变量使之生效:
user@node01:~/Kubeflow$ source /etc/profile
1.8 创建${KF_DIR}
目录
user@node01:~/Kubeflow$ mkdir -p ${KF_DIR}
执行完可看到在指定路径创建了该工作目录:
1.9 安装部署
# 移动kfctl_k8s_istio.v1.2.0.yaml文件到工作目录中
user@node01:~/Kubeflow$ mv kfctl_k8s_istio.v1.2.0.yaml Kubeflow1.2.0/
user@node01:~/Kubeflow$ ls
kfctl_v1.2.0-0-gbc038f9_linux.tar.gz Kubeflow1.2.0
# 进入工作目录
user@node01:~/Kubeflow$ cd Kubeflow1.2.0/
user@node01:~/Kubeflow/Kubeflow1.2.0$ ls
kfctl_k8s_istio.v1.2.0.yaml
# 执行安装,报错
user@node01:~/Kubeflow/Kubeflow1.2.0$ kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml
INFO[0000] No name specified in KfDef.Metadata.Name; defaulting to Kubeflow1.2.0 based on location of config file: kfctl_k8s_istio.v1.2.0.yaml. filename="coordinator/coordinator.go:202"
INFO[0000]
****************************************************************
Notice anonymous usage reporting enabled using spartakus
To disable it
If you have already deployed it run the following commands:
cd $(pwd)
kubectl -n ${K8S_NAMESPACE} delete deploy -l app=spartakus
For more info: https://www.kubeflow.org/docs/other-guides/usage-reporting/
****************************************************************
filename="coordinator/coordinator.go:120"
INFO[0000] Creating directory .cache filename="kfconfig/types.go:450"
INFO[0000] Fetching https://github.com/kubeflow/manifests/archive/v1.2.0.tar.gz to .cache/manifests filename="kfconfig/types.go:498"
INFO[0004] Updating localPath to .cache/manifests/manifests-1.2.0 filename="kfconfig/types.go:569"
INFO[0004] Fetch succeeded; LocalPath .cache/manifests/manifests-1.2.0 filename="kfconfig/types.go:590"
INFO[0004] Processing application: namespaces filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/namespaces filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: application filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/application filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: istio-stack filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/istio-stack filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: cluster-local-gateway filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/cluster-local-gateway filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: istio filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/istio filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: cert-manager-crds filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/cert-manager-crds filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: cert-manager-kube-system-resources filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/cert-manager-kube-system-resources filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: cert-manager filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/cert-manager filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: add-anonymous-user-filter filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/add-anonymous-user-filter filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: metacontroller filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/metacontroller filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: bootstrap filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/bootstrap filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: spark-operator filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/spark-operator filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: kubeflow-apps filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/kubeflow-apps filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: knative filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/knative filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: kfserving filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/kfserving filename="kustomize/kustomize.go:667"
INFO[0004] Processing application: spartakus filename="kustomize/kustomize.go:569"
INFO[0004] Creating folder kustomize/spartakus filename="kustomize/kustomize.go:667"
INFO[0004] .cache/manifests exists; not resyncing filename="kfconfig/types.go:473"
INFO[0004] namespace: kubeflow filename="utils/k8utils.go:433"
INFO[0004] Creating namespace: kubeflow filename="utils/k8utils.go:438"
Error: failed to apply: (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize: (kubeflow.error): Code 400 with message: couldn't create namespace kubeflow Error: Post "http://localhost:8080/api/v1/namespaces": dial tcp [::1]:8080: connect: connection refused
Usage:
kfctl apply -f ${CONFIG} [flags]
Flags:
--context string Optional kubernetes context to use when applying resources. Currently not used by KFDef resources.
-f, --file string Static config file to use. Can be either a local path:
export CONFIG=./kfctl_gcp_iap.yaml
or a URL:
export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.0.yaml
export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_istio_dex.v1.2.0.yaml
export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_aws.v1.2.0.yaml
export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml
kfctl apply -V --file=${CONFIG}
-h, --help help for apply
-V, --verbose verbose output default is false
kfctl exited with error: failed to apply: (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize: (kubeflow.error): Code 400 with message: couldn't create namespace kubeflow Error: Post "http://localhost:8080/api/v1/namespaces": dial tcp [::1]:8080: connect: connection refused
user@node01:~/Kubeflow/Kubeflow1.2.0$
报错:无法创建Kubeflow命名空间。
出现的变化是:生成了一个kustomize
的文件夹
猜想问题所在:应该把集群运行起来,前面基础要求就是集群中至少有一个工作节点,而我目前只是在node01单机上操作。
将集群运行起来重新执行后还是同样的问题。。
1.9.1 在master上安装Kubeflow(重要须知)
意识到可能得在master上部署,因为在node01上是在创建命名空间那一步出错的,而命名空间管理是master进行管理的。尝试~
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml
...
application.app.k8s.io/knative-serving-crds created
application.app.k8s.io/knative-serving-install created
gateway.networking.istio.io/cluster-local-gateway created
horizontalpodautoscaler.autoscaling/activator created
image.caching.internal.knative.dev/queue-proxy created
servicerole.rbac.istio.io/istio-service-role created
servicerolebinding.rbac.istio.io/istio-service-role-binding created
INFO[0332] Successfully applied application knative filename="kustomize/kustomize.go:291"
INFO[0332] Deploying application kfserving filename="kustomize/kustomize.go:266"
secret/kfserving-webhook-server-secret created
configmap/inferenceservice-config created
customresourcedefinition.apiextensions.k8s.io/inferenceservices.serving.kubeflow.org created
clusterrole.rbac.authorization.k8s.io/kubeflow-kfserving-edit created
clusterrole.rbac.authorization.k8s.io/kfserving-manager-role created
clusterrole.rbac.authorization.k8s.io/kfserving-proxy-role created
clusterrole.rbac.authorization.k8s.io/kubeflow-kfserving-admin created
clusterrole.rbac.authorization.k8s.io/kubeflow-kfserving-view created
clusterrolebinding.rbac.authorization.k8s.io/kfserving-manager-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/kfserving-proxy-rolebinding created
role.rbac.authorization.k8s.io/leader-election-role created
rolebinding.rbac.authorization.k8s.io/leader-election-rolebinding created
service/kfserving-controller-manager-metrics-service created
service/kfserving-controller-manager-service created
service/kfserving-webhook-server-service created
statefulset.apps/kfserving-controller-manager created
mutatingwebhookconfiguration.admissionregistration.k8s.io/inferenceservice.serving.kubeflow.org created
validatingwebhookconfiguration.admissionregistration.k8s.io/inferenceservice.serving.kubeflow.org created
application.app.k8s.io/kfserving created
certificate.cert-manager.io/serving-cert created
issuer.cert-manager.io/selfsigned-issuer created
INFO[0333] Successfully applied application kfserving filename="kustomize/kustomize.go:291"
INFO[0333] Deploying application spartakus filename="kustomize/kustomize.go:266"
configmap/spartakus-config created
serviceaccount/spartakus created
clusterrole.rbac.authorization.k8s.io/spartakus created
clusterrolebinding.rbac.authorization.k8s.io/spartakus created
deployment.apps/spartakus-volunteer created
application.app.k8s.io/spartakus created
INFO[0333] Successfully applied application spartakus filename="kustomize/kustomize.go:291"
INFO[0333] Applied the configuration Successfully! filename="cmd/apply.go:75"
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0#
运行成功,但中间出现了很多报错:类似下图
1.10 查看组件状态
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl -n kubeflow get all
NAME READY STATUS RESTARTS AGE
pod/application-controller-stateful-set-0 0/1 ErrImagePull 0 6m38s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/admission-webhook-service ClusterIP 10.103.120.39 <none> 443/TCP 71s
service/application-controller-service ClusterIP 10.111.227.112 <none> 443/TCP 6m38s
service/argo-ui NodePort 10.101.170.70 <none> 80:30807/TCP 71s
service/cache-server ClusterIP 10.105.186.146 <none> 443/TCP 71s
service/centraldashboard ClusterIP 10.104.172.133 <none> 80/TCP 71s
service/jupyter-web-app-service ClusterIP 10.101.207.115 <none> 80/TCP 71s
service/katib-controller ClusterIP 10.99.87.123 <none> 443/TCP,8080/TCP 71s
service/katib-db-manager ClusterIP 10.100.152.113 <none> 6789/TCP 71s
service/katib-mysql ClusterIP 10.104.225.100 <none> 3306/TCP 71s
service/katib-ui ClusterIP 10.100.126.162 <none> 80/TCP 71s
service/kfserving-controller-manager-metrics-service ClusterIP 10.108.136.216 <none> 8443/TCP 70s
service/kfserving-controller-manager-service ClusterIP 10.96.9.106 <none> 443/TCP 70s
service/kfserving-webhook-server-service ClusterIP 10.106.99.36 <none> 443/TCP 70s
service/kubeflow-pipelines-profile-controller ClusterIP 10.96.157.140 <none> 80/TCP 71s
service/metadata-db ClusterIP 10.107.111.240 <none> 3306/TCP 71s
service/metadata-envoy-service ClusterIP 10.97.104.91 <none> 9090/TCP 71s
service/metadata-grpc-service ClusterIP 10.105.204.174 <none> 8080/TCP 71s
service/minio-service ClusterIP 10.106.7.185 <none> 9000/TCP 71s
service/ml-pipeline ClusterIP 10.96.27.108 <none> 8888/TCP,8887/TCP 71s
service/ml-pipeline-ui ClusterIP 10.102.174.60 <none> 80/TCP 71s
service/ml-pipeline-visualizationserver ClusterIP 10.97.229.74 <none> 8888/TCP 71s
service/mysql ClusterIP 10.108.49.231 <none> 3306/TCP 71s
service/notebook-controller-service ClusterIP 10.110.173.10 <none> 443/TCP 71s
service/profiles-kfam ClusterIP 10.97.143.27 <none> 8081/TCP 71s
service/pytorch-operator ClusterIP 10.108.162.192 <none> 8443/TCP 71s
service/seldon-webhook-service ClusterIP 10.97.17.96 <none> 443/TCP 71s
service/tf-job-operator ClusterIP 10.98.163.53 <none> 8443/TCP 71s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/admission-webhook-deployment 0/1 0 0 71s
deployment.apps/argo-ui 0/1 0 0 71s
deployment.apps/cache-deployer-deployment 0/1 0 0 71s
deployment.apps/cache-server 0/1 0 0 71s
deployment.apps/centraldashboard 0/1 0 0 71s
deployment.apps/jupyter-web-app-deployment 0/1 0 0 71s
deployment.apps/katib-controller 0/1 0 0 71s
deployment.apps/katib-db-manager 0/1 0 0 71s
deployment.apps/katib-mysql 0/1 0 0 71s
deployment.apps/katib-ui 0/1 0 0 71s
deployment.apps/kubeflow-pipelines-profile-controller 0/1 0 0 71s
deployment.apps/metadata-db 0/1 0 0 71s
deployment.apps/metadata-envoy-deployment 0/1 0 0 71s
deployment.apps/metadata-grpc-deployment 0/1 0 0 71s
deployment.apps/metadata-writer 0/1 0 0 71s
deployment.apps/minio 0/1 0 0 71s
deployment.apps/ml-pipeline 0/1 0 0 71s
deployment.apps/ml-pipeline-persistenceagent 0/1 0 0 71s
deployment.apps/ml-pipeline-scheduledworkflow 0/1 0 0 71s
deployment.apps/ml-pipeline-ui 0/1 0 0 71s
deployment.apps/ml-pipeline-viewer-crd 0/1 0 0 71s
deployment.apps/ml-pipeline-visualizationserver 0/1 0 0 71s
deployment.apps/mpi-operator 0/1 0 0 71s
deployment.apps/mxnet-operator 0/1 0 0 71s
deployment.apps/mysql 0/1 0 0 71s
deployment.apps/notebook-controller-deployment 0/1 0 0 71s
deployment.apps/profiles-deployment 0/1 0 0 71s
deployment.apps/pytorch-operator 0/1 0 0 71s
deployment.apps/seldon-controller-manager 0/1 0 0 71s
deployment.apps/spark-operatorsparkoperator 0/1 0 0 73s
deployment.apps/spartakus-volunteer 0/1 0 0 69s
deployment.apps/tf-job-operator 0/1 0 0 71s
deployment.apps/workflow-controller 0/1 0 0 71s
NAME DESIRED CURRENT READY AGE
replicaset.apps/admission-webhook-deployment-5d9ccb5696 1 0 0 71s
replicaset.apps/argo-ui-684bcb587f 1 0 0 71s
replicaset.apps/cache-deployer-deployment-6667847478 1 0 0 71s
replicaset.apps/cache-server-bd9c859db 1 0 0 67s
replicaset.apps/centraldashboard-895c4c768 1 0 0 71s
replicaset.apps/jupyter-web-app-deployment-6588c6f544 1 0 0 71s
replicaset.apps/katib-controller-75c8d47f8c 1 0 0 71s
replicaset.apps/katib-db-manager-6c88c68d79 1 0 0 71s
replicaset.apps/katib-mysql-858f68f588 1 0 0 69s
replicaset.apps/katib-ui-68f59498d4 1 0 0 71s
replicaset.apps/kubeflow-pipelines-profile-controller-69c94df75b 1 0 0 71s
replicaset.apps/metadata-db-757dc9c7b5 1 0 0 71s
replicaset.apps/metadata-envoy-deployment-6ff58757f6 1 0 0 71s
replicaset.apps/metadata-grpc-deployment-76d69f69c8 1 0 0 71s
replicaset.apps/metadata-writer-6d94ffb7df 1 0 0 70s
replicaset.apps/minio-66c9cd74c9 1 0 0 70s
replicaset.apps/ml-pipeline-54989c9946 1 0 0 70s
replicaset.apps/ml-pipeline-persistenceagent-7f6bf7646 1 0 0 70s
replicaset.apps/ml-pipeline-scheduledworkflow-66db7bcf5d 1 0 0 70s
replicaset.apps/ml-pipeline-ui-756b58fb 1 0 0 67s
replicaset.apps/ml-pipeline-viewer-crd-58f59f87db 1 0 0 69s
replicaset.apps/ml-pipeline-visualizationserver-6f9ff4974 1 0 0 69s
replicaset.apps/mpi-operator-77bb5d8f4b 1 0 0 69s
replicaset.apps/mxnet-operator-68b688bb69 1 0 0 69s
replicaset.apps/mysql-7694c6b8b7 1 0 0 68s
replicaset.apps/notebook-controller-deployment-58447d4b4c 1 0 0 68s
replicaset.apps/profiles-deployment-78d4549cbc 1 0 0 68s
replicaset.apps/pytorch-operator-b79799447 1 0 0 68s
replicaset.apps/seldon-controller-manager-5fc5dfc86c 1 0 0 68s
replicaset.apps/spark-operatorsparkoperator-67c6bc65fb 1 0 0 73s
replicaset.apps/spartakus-volunteer-6ddc7b6676 1 0 0 65s
replicaset.apps/tf-job-operator-5c97f4bf7 1 0 0 67s
replicaset.apps/workflow-controller-5c7cc7976d 1 0 0 67s
NAME READY AGE
statefulset.apps/admission-webhook-bootstrap-stateful-set 0/1 73s
statefulset.apps/application-controller-stateful-set 0/1 6m38s
statefulset.apps/kfserving-controller-manager 0/1 70s
statefulset.apps/metacontroller 0/1 73s
发现pod镜像拉取失败,并且各deployment和apps都没有ready,通常来讲是外网问题,但我配置了外网呀,这是为啥呢?
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl get pod --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-59b485c4cc-zmj68 1/1 Running 0 38m
cert-manager cert-manager-cainjector-5bb487bcd-t8hsg 1/1 Running 0 38m
cert-manager cert-manager-webhook-74b4bd9bcc-vxnsb 1/1 Running 0 38m
default federated-deployment-655454d67c-kw4nd 0/1 CrashLoopBackOff 1491 39d
default federated-deployment-655454d67c-q54hk 0/1 CrashLoopBackOff 1492 39d
istio-system cluster-local-gateway-84bb595449-pjwjm 0/1 Running 0 38m
istio-system istio-citadel-7f66ddfcfb-zmdwm 0/1 ImagePullBackOff 0 38m
istio-system istio-galley-7976dd55cd-cspbv 0/1 ContainerCreating 0 38m
istio-system istio-ingressgateway-c79f9f6f-cs8f4 0/1 ContainerCreating 0 38m
istio-system istio-nodeagent-4ntfm 0/1 ImagePullBackOff 0 38m
istio-system istio-nodeagent-hpnpc 0/1 ImagePullBackOff 0 27m
istio-system istio-pilot-7bd96d69d9-xmt4f 0/2 ContainerCreating 0 38m
istio-system istio-policy-66b5d9887c-ltgcw 0/2 ContainerCreating 0 38m
istio-system istio-security-post-install-release-1.3-latest-daily-ghgqk 0/1 ImagePullBackOff 0 38m
istio-system istio-sidecar-injector-56b6997f7d-jq5df 0/1 ContainerCreating 0 38m
istio-system istio-telemetry-856f7bcff4-475l7 0/2 ContainerCreating 0 38m
istio-system prometheus-65fdcbc857-d2hhs 0/1 ContainerCreating 0 38m
knative-serving activator-789bcb5644-txkqz 0/1 ImagePullBackOff 0 32m
knative-serving autoscaler-5888bf7697-bprrd 0/1 ImagePullBackOff 0 32m
knative-serving controller-7f646849cd-nfrtz 0/1 ImagePullBackOff 0 32m
knative-serving istio-webhook-7db84bf7bf-l62c9 0/1 ImagePullBackOff 0 32m
knative-serving networking-istio-55d86868c6-8shwd 0/1 ImagePullBackOff 0 32m
knative-serving webhook-579f9448c4-9pcw4 0/1 ImagePullBackOff 0 32m
kube-system coredns-66bff467f8-p8txx 1/1 Running 20 40d
kube-system coredns-66bff467f8-qqrn9 1/1 Running 20 40d
kube-system etcd-master 1/1 Running 4 40d
kube-system kube-apiserver-master 1/1 Running 10337 40d
kube-system kube-controller-manager-master 1/1 Running 23 40d
kube-system kube-flannel-ds-8gb4m 1/1 Running 21 40d
kube-system kube-flannel-ds-tpnlj 1/1 Running 11 40d
kube-system kube-proxy-vrcts 1/1 Running 19 40d
kube-system kube-proxy-w8sv8 1/1 Running 4 40d
kube-system kube-scheduler-master 1/1 Running 24 40d
kubeflow application-controller-stateful-set-0 0/1 ImagePullBackOff 0 38m
和kubeflow相关的新创建的pod都没有成功。。
2 Install Kubeflow with Aliyun-local
2.1 先删除之前部署的Kubeflow1.2.0组件
kfctl delete -V -f kfctl_k8s_istio.v1.2.0.yaml
需要半个小时以上才能删完。
并且会出现删除不干净的报错:
Error: couldn't delete KfApp: (kubeflow.error): Code 500 with message: kfApp Delete failed for kustomize: (kubeflow.error): Code 500 with message: error deleting kustomize manifests: [error evaluating kustomization manifest for knative: Timed out waiting for resource /knative-serving to be deleted. Error deleted resource is not cleaned up yet, error evaluating kustomization manifest for cert-manager: Timed out waiting for resource /cert-manager to be deleted. Error deleted resource is not cleaned up yet, error evaluating kustomization manifest for cluster-local-gateway: Timed out waiting for resource /istio-system to be deleted. Error deleted resource is not cleaned up yet, error evaluating kustomization manifest for istio-stack: Timed out waiting for resource /istio-system to be deleted. Error deleted resource is not cleaned up yet, error evaluating kustomization manifest for namespaces: Timed out waiting for resource /cert-manager to be deleted. Error deleted resource is not cleaned up yet, error evaluating kustomization manifest for namespaces: Timed out waiting for resource /kubeflow to be deleted. Error deleted resource is not cleaned up yet]
Usage:
kfctl delete [flags]
Flags:
--delete_storage Set if you want to delete app's storage cluster used for mlpipeline.
-f, --file string The local config file of KfDef.
--force-deletion force-deletion output default is false
-h, --help help for delete
-V, --verbose verbose output default is false
好像不用管,貌似是因为集群中的node01节点没有加入进来,而master安装kubeflow时把相关组件安装在node01上。
因为后面把node01运行起来之后,那些一直terminating的pod就被删除了。
如下:
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl get pod --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-66bff467f8-p8txx 1/1 Running 20 40d
kube-system coredns-66bff467f8-qqrn9 1/1 Running 20 40d
kube-system etcd-master 1/1 Running 4 40d
kube-system kube-apiserver-master 1/1 Running 10533 40d
kube-system kube-controller-manager-master 1/1 Running 24 40d
kube-system kube-flannel-ds-8gb4m 1/1 Running 22 40d
kube-system kube-flannel-ds-tpnlj 1/1 Running 11 40d
kube-system kube-proxy-vrcts 1/1 Running 20 40d
kube-system kube-proxy-w8sv8 1/1 Running 4 40d
kube-system kube-scheduler-master 1/1 Running 25 40d
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl -n kubeflow get all
No resources found in kubeflow namespace.
2.2 依据教程创建
# 创建独立工作文件夹
root@master:/home/hqc/Kubeflow# mkdir Kubeflow1.3
root@master:/home/hqc/Kubeflow# cd Kubeflow1.3/
# 克隆源项目
root@master:/home/hqc/Kubeflow/Kubeflow1.3# git clone https://github.com/shikanon/kubeflow-manifests.git
正克隆到 'kubeflow-manifests'...
remote: Enumerating objects: 552, done.
remote: Counting objects: 100% (552/552), done.
remote: Compressing objects: 100% (358/358), done.
remote: Total 552 (delta 201), reused 506 (delta 171), pack-reused 0
接收对象中: 100% (552/552), 571.84 KiB | 316.00 KiB/s, 完成.
处理 delta 中: 100% (201/201), 完成.
# 进入文件夹
root@master:/home/hqc/Kubeflow/Kubeflow1.3# cd kubeflow-manifests
# 执行一键部署程序,会根据yaml文件一个个安装
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# python install.py
kubectl apply -f ./manifest1.3/001-cert-manager-cert-manager-kube-system-resources-base.yaml
b'role.rbac.authorization.k8s.io/cert-manager-cainjector:leaderelection created\nrole.rbac.authorization.k8s.io/cert-manager:leaderelection created\nrolebinding.rbac.authorization.k8s.io/cert-manager-cainjector:leaderelection created\nrolebinding.rbac.authorization.k8s.io/cert-manager-webhook:webhook-authentication-reader created\nrolebinding.rbac.authorization.k8s.io/cert-manager:leaderelection created\nconfigmap/cert-manager-kube-params-parameters created\n'
kubectl apply -f ./manifest1.3/002-cert-manager-cert-manager-crds-base.yaml
b'customresourcedefinition.apiextensions.k8s.io/certificaterequests.cert-manager.io created\ncustomresourcedefinition.apiextensions.k8s.io/certificates.cert-manager.io created\ncustomresourcedefinition.apiextensions.k8s.io/challenges.acme.cert-manager.io created\ncustomresourcedefinition.apiextensions.k8s.io/clusterissuers.cert-manager.io created\ncustomresourcedefinition.apiextensions.k8s.io/issuers.cert-manager.io created\ncustomresourcedefinition.apiextensions.k8s.io/orders.acme.cert-manager.io created\n'
kubectl apply -f ./manifest1.3/003-cert-manager-overlays-self-signed.yaml
b'namespace/cert-manager created\nserviceaccount/cert-manager created\nserviceaccount/cert-manager-cainjector created\nserviceaccount/cert-manager-webhook created\nclusterrole.rbac.authorization.k8s.io/cert-manager-edit created\nclusterrole.rbac.authorization.k8s.io/cert-manager-view created\nclusterrole.rbac.authorization.k8s.io/cert-manager-webhook:webhook-requester created\nclusterrole.rbac.authorization.k8s.io/cert-manager-cainjector created\nclusterrole.rbac.authorization.k8s.io/cert-manager-controller-certificates created\nclusterrole.rbac.authorization.k8s.io/cert-manager-controller-challenges created\nclusterrole.rbac.authorization.k8s.io/cert-manager-controller-clusterissuers created\nclusterrole.rbac.authorization.k8s.io/cert-manager-controller-ingress-shim created\nclusterrole.rbac.authorization.k8s.io/cert-manager-controller-issuers created\nclusterrole.rbac.authorization.k8s.io/cert-manager-controller-orders created\nclusterrolebinding.rbac.authorization.k8s.io/cert-manager-cainjector created\nclusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-certificates created\nclusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-challenges created\nclusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-clusterissuers created\nclusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-ingress-shim created\nclusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-issuers created\nclusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-orders created\nclusterrolebinding.rbac.authorization.k8s.io/cert-manager-webhook:auth-delegator created\nconfigmap/cert-manager-parameters created\nservice/cert-manager created\nservice/cert-manager-webhook created\ndeployment.apps/cert-manager created\ndeployment.apps/cert-manager-cainjector created\ndeployment.apps/cert-manager-webhook created\napiservice.apiregistration.k8s.io/v1beta1.webhook.cert-manager.io created\nclusterissuer.cert-manager.io/kubeflow-self-signing-issuer created\nmutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created\nvalidatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created\n'
...
部署过程中完全没报错,太牛了!!!
最后面发现还是报错了的,报错没有颜色区别显示,导致眼花没看到
...
kubectl apply -f ./manifest1.3/017-pipeline-env-platform-agnostic-multi-user.yaml
error: unable to recognize "./manifest1.3/017-pipeline-env-platform-agnostic-multi-user.yaml": no matches for kind "CompositeController" in version "metacontroller.k8s.io/v1alpha1"
...
...
Error from server (NotFound): error when deleting "./patch/data.yaml": deployments.apps "minio" not found
b''
b'deployment.apps/minio created\n'
b'envoyfilter.networking.istio.io "authn-filter" deleted\n'
b'envoyfilter.networking.istio.io/authn-filter created\n'
Error from server (NotFound): error when deleting "./patch/istio-ingressgateway.yaml": deployments.apps "istio-ingressgateway" not found
b''
b'deployment.apps/istio-ingressgateway created\n'
Error from server (NotFound): error when deleting "./patch/istiod.yaml": deployments.apps "istiod" not found
b'configmap "istio-sidecar-injector" deleted\n'
b'deployment.apps/istiod created\nconfigmap/istio-sidecar-injector created\n'
b'deployment.apps "jupyter-web-app-deployment" deleted\n'
b'deployment.apps/jupyter-web-app-deployment created\n'
b'image.caching.internal.knative.dev "queue-proxy" deleted\nconfigmap "config-deployment" deleted\nconfigmap "inferenceservice-config" deleted\n'
b'image.caching.internal.knative.dev/queue-proxy created\nconfigmap/config-deployment created\nconfigmap/inferenceservice-config created\n'
Error from server (NotFound): error when deleting "./patch/pipeline-env-platform-agnostic-multi-user.yaml": configmaps "kubeflow-pipelines-profile-controller-code-c2cd68d9k4" not found
Error from server (NotFound): error when deleting "./patch/pipeline-env-platform-agnostic-multi-user.yaml": configmaps "pipeline-install-config" not found
Error from server (NotFound): error when deleting "./patch/pipeline-env-platform-agnostic-multi-user.yaml": deployments.apps "workflow-controller" not found
Error from server (NotFound): error when deleting "./patch/pipeline-env-platform-agnostic-multi-user.yaml": deployments.apps "kubeflow-pipelines-profile-controller" not found
b''
b'configmap/kubeflow-pipelines-profile-controller-code-c2cd68d9k4 created\nconfigmap/pipeline-install-config created\ndeployment.apps/workflow-controller created\ndeployment.apps/kubeflow-pipelines-profile-controller created\n'
b'deployment.apps "tensorboards-web-app-deployment" deleted\n'
b'deployment.apps/tensorboards-web-app-deployment created\n'
b'deployment.apps "volumes-web-app-deployment" deleted\n'
b'deployment.apps/volumes-web-app-deployment created\n'
Error from server (NotFound): error when deleting "./patch/workflow-controller.yaml": configmaps "workflow-controller-configmap" not found
Error from server (NotFound): error when deleting "./patch/workflow-controller.yaml": deployments.apps "cache-server" not found
b'deployment.apps "workflow-controller" deleted\n'
b'configmap/workflow-controller-configmap created\ndeployment.apps/workflow-controller created\ndeployment.apps/cache-server created\n'
如下图:
主要是:当删除yaml
文件时deployments.apps
和configmaps
找不到的报错。
另外查看 发现目前还有这5个问题Pending
、Not Ready
、ContainerCreating
、CrashLoopBackoff
、CreatContainerConfigError
2.3 问题排查
2.3.1 Pending
Pending
问题根据之前的经验可能是没有创建pv
或pvc
造成的,可从这方面入手。
一般来说,Pending
是指挂起状态,表示创建的Pod找不到可以运行它的物理节点,不能调度到相应的节点上运行。
# 查看pod细节
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl describe pods authservice-0 -n istio-system
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 93s (x49 over 66m) default-scheduler running "VolumeBinding" filter plugin for pod "authservice-0": pod has unbound immediate PersistentVolumeClaims
# 果然和持久化挂载卷有关
# 查看日志
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl logs authservice-0 -n istio-system
# 无日志输出,不知道为啥
2.3.2 Not Ready
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl describe pods cluster-local-gateway-d8688cfdd-m4znc -n istio-system
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 78s (x4588 over 4h53m) kubelet, node01 Readiness probe failed: HTTP probe failed with statuscode: 503
# 这是什么原因,暂不知晓
2.3.3 ContainerCreating
重新运行一下之后新出现了一个pod一直在ContainerCreating,查看:
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl describe pods cluster-local-gateway-54568d47c5-2jk7s -n istio-system
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 60m default-scheduler Successfully assigned istio-system/cluster-local-gateway-54568d47c5-2jk7s to node01
Warning FailedMount 58m kubelet, node01 Unable to attach or mount volumes: unmounted volumes=[istio-token], unattached volumes=[config-volume podinfo cluster-local-gateway-service-account-token-6f4dv istio-envoy ingressgateway-ca-certs istio-token istio-data ingressgateway-certs istiod-ca-cert]: timed out waiting for the condition
Warning FailedMount 55m kubelet, node01 Unable to attach or mount volumes: unmounted volumes=[istio-token], unattached volumes=[ingressgateway-ca-certs cluster-local-gateway-service-account-token-6f4dv istio-envoy istiod-ca-cert istio-token ingressgateway-certs config-volume istio-data podinfo]: timed out waiting for the condition
Warning FailedMount 53m kubelet, node01 Unable to attach or mount volumes: unmounted volumes=[istio-token], unattached volumes=[istio-envoy istio-token podinfo cluster-local-gateway-service-account-token-6f4dv ingressgateway-certs ingressgateway-ca-certs config-volume istiod-ca-cert istio-data]: timed out waiting for the condition
Warning FailedMount 51m kubelet, node01 Unable to attach or mount volumes: unmounted volumes=[istio-token], unattached volumes=[ingressgateway-certs cluster-local-gateway-service-account-token-6f4dv config-volume istio-token istio-data istio-envoy podinfo ingressgateway-ca-certs istiod-ca-cert]: timed out waiting for the condition
Warning FailedMount 49m kubelet, node01 Unable to attach or mount volumes: unmounted volumes=[istio-token], unattached volumes=[istio-data podinfo ingressgateway-certs istiod-ca-cert ingressgateway-ca-certs cluster-local-gateway-service-account-token-6f4dv config-volume istio-token istio-envoy]: timed out waiting for the condition
Warning FailedMount 46m kubelet, node01 Unable to attach or mount volumes: unmounted volumes=[istio-token], unattached volumes=[config-volume istio-data istiod-ca-cert istio-token ingressgateway-certs ingressgateway-ca-certs istio-envoy podinfo cluster-local-gateway-service-account-token-6f4dv]: timed out waiting for the condition
Warning FailedMount 44m kubelet, node01 Unable to attach or mount volumes: unmounted volumes=[istio-token], unattached volumes=[istiod-ca-cert istio-envoy cluster-local-gateway-service-account-token-6f4dv config-volume istio-token istio-data podinfo ingressgateway-certs ingressgateway-ca-certs]: timed out waiting for the condition
Warning FailedMount 42m kubelet, node01 Unable to attach or mount volumes: unmounted volumes=[istio-token], unattached volumes=[istio-envoy istio-token cluster-local-gateway-service-account-token-6f4dv istiod-ca-cert config-volume ingressgateway-certs istio-data podinfo ingressgateway-ca-certs]: timed out waiting for the condition
Warning FailedMount 9m12s (x19 over 40m) kubelet, node01 (combined from similar events): MountVolume.SetUp failed for volume "istio-token" : failed to fetch token: the API server does not have TokenRequest endpoints enabled
Warning FailedMount 5m8s (x29 over 60m) kubelet, node01 MountVolume.SetUp failed for volume "istio-token" : failed to fetch token: the API server does not have TokenRequest endpoints enabled
# 也和挂载卷相关
与之前对比发现,多了一个一样的pod,觉得是重新安装时产生的问题,因此删掉:
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl get deployment -n istio-system
NAME READY UP-TO-DATE AVAILABLE AGE
cluster-local-gateway 0/1 1 0 5h25m
istio-ingressgateway 0/1 1 0 5h25m
istiod 1/1 1 1 5h25m
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl delete deployment cluster-local-gateway -n istio-system
deployment.apps "cluster-local-gateway" deleted
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl get deployment -n istio-system
NAME READY UP-TO-DATE AVAILABLE AGE
istio-ingressgateway 0/1 1 0 5h25m
istiod 1/1 1 1 5h25m
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl get pod --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-cainjector-846b7c9f8c-7vmxf 1/1 Running 33 5h31m
cert-manager cert-manager-fbc979d45-rws9g 1/1 Running 1 5h31m
cert-manager cert-manager-webhook-67956cb44b-hz6c4 1/1 Running 0 5h31m
istio-system authservice-0 0/1 Pending 0 5h30m
istio-system istio-ingressgateway-84f6567479-4z9q4 0/1 Running 0 5h25m
istio-system istiod-5d6d848d84-8fwg8 1/1 Running 0 5h25m
knative-eventing broker-controller-d675f7d9f-hb6bg 1/1 Running 0 5h29m
2.3.4 CrashLoopBackoff
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl describe pods katib-db-manager-755464ffcf-f4wl8 -n kubeflow
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 9m24s (x184 over 5h10m) kubelet, node01 Readiness probe failed: timeout: failed to connect service ":6789" within 1s
Warning BackOff 4m24s (x630 over 5h8m) kubelet, node01 Back-off restarting failed container
2.3.5 CreatContainerConfigError
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl describe pods kubeflow-pipelines-profile-controller-65c8c9dc9c-2g6pm -n kubeflow
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 3m54s (x749 over 5h5m) kubelet, node01 Container image "registry.cn-shenzhen.aliyuncs.com/tensorbytes/python:3.7-3a781" already present on machine
# 想不明白,为啥容器镜像已经存在机器上还更出现这个情况
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl describe pods minio-6f4c68d54f-q7mnl -n kubeflow
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 80s (x784 over 5h10m) kubelet, node01 Container image "registry.cn-shenzhen.aliyuncs.com/tensorbytes/ml-pipeline-minio:RELEASE.2019-08-14T20-37-41Z-license-compliance-290a7" already present on machine
2.3.6 问题总结
主要出现问题的是这几个pod:authservice-0
、cluster-local-gateway
、istio-ingressgateway
、katib-db-manager
、katib-mysql
、kubeflow-pipelines-profile-controller
、minio
# 查看相关组件
root@master:/home/hqc/Kubeflow/Kubeflow1.2.0# kubectl -n kubeflow get all
还需格外注意以下几个组件:deployment.apps/cache-server
、deployment.apps/katib-db-manager
、deployment.apps/katib-mysql
、deployment.apps/kubeflow-pipelines-profile-controller
、deployment.apps/minio
、deployment.apps/workflow-controller
、replicaset.apps/cache-server
、replicaset.apps/katib-db-manager
、replicaset.apps/katib-mysql
、replicaset.apps/kubeflow-pipelines-profile-controller
、replicaset.apps/minio
、replicaset.apps/workflow-controller-7b8f56f6c
2.3.7 解决部分问题(可以登录UI,但还有部分组件异常)
要想可以登录UI,至少要保证istio-system
和knative-eventing
命名空间全部running。
之前情况是:istio-system|authservice-0
和kubeflow|katib-mysql
组件处于Pending
状态,这一般是和没有创建pv
和pvc
有关。
# 针对那两个组件创建文件夹
root@master:/home/hqc/Kubeflow/Kubeflow1.3# mkdir pv1
root@master:/home/hqc/Kubeflow/Kubeflow1.3# mkdir pv2
# 创建pv.yaml文件
root@master:/home/hqc/Kubeflow/Kubeflow1.3# vim pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-authservice
spec:
capacity:
storage: 25Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/home/hqc/kubeflow/Kubeflow1.3/pv1"
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-katib-mysql
spec:
capacity:
storage: 25Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/home/hqc/kubeflow/Kubeflow1.3/pv2"
# 部署
root@master:/home/hqc/Kubeflow/Kubeflow1.3# kubectl apply -f pv.yaml
persistentvolume/pv-authservice created
persistentvolume/pv-katib-mysql created
# 查看
root@master:/home/hqc/Kubeflow/Kubeflow1.3# kubectl get pvc --all-namespaces -o wide
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE
istio-system authservice-pvc Bound pvc-c3484c8f-169c-41fb-80bf-5e58935969fa 10Gi RWO local-path 40m Filesystem
kubeflow katib-mysql Bound pv-katib-mysql 25Gi RWO 16h Filesystem
running!
2.3.8 登录
# 查看
root@master:/home/hqc/Kubeflow/Kubeflow1.3# kubectl get svc/istio-ingressgateway -n istio-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
istio-ingressgateway NodePort 10.108.100.5 <none> 15021:31201/TCP,80:30000/TCP,443:32407/TCP,31400:30740/TCP,15443:31863/TCP 16h
看到第二个80:30000/TCP
可知端口号是30000
,因此可访问localhost:30000
进入登录界面。
实验过后发现,k8s集群中任意节点的IP都可以访问:
输入账号密码即可登录,这里的账号密码可以通过patch/auth.yaml
进行更改。 默认的用户名是admin@example.com
,密码是password
登录后进入kubeflow界面:
2.4 重新部署
由于之前部署操作有点混乱,虽然可以登录了,但遗留了不少异常问题:
为了不影响后续的部署,出问题了更麻烦,因此决定重新部署~
2.4.1 删除所有相关组件
## database-patch
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f database-patch/mysql-persistent-storage.yaml
deployment.apps "mysql" deleted
## local-path
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f local-path/local-path-storage.yaml
namespace "local-path-storage" deleted
serviceaccount "local-path-provisioner-service-account" deleted
clusterrole.rbac.authorization.k8s.io "local-path-provisioner-role" deleted
clusterrolebinding.rbac.authorization.k8s.io "local-path-provisioner-bind" deleted
deployment.apps "local-path-provisioner" deleted
storageclass.storage.k8s.io "local-path" deleted
configmap "local-path-config" deleted
## manifest1.3
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f manifest1.3/001-cert-manager-cert-manager-kube-system-resources-base.yaml
role.rbac.authorization.k8s.io "cert-manager-cainjector:leaderelection" deleted
role.rbac.authorization.k8s.io "cert-manager:leaderelection" deleted
rolebinding.rbac.authorization.k8s.io "cert-manager-cainjector:leaderelection" deleted
rolebinding.rbac.authorization.k8s.io "cert-manager-webhook:webhook-authentication-reader" deleted
rolebinding.rbac.authorization.k8s.io "cert-manager:leaderelection" deleted
configmap "cert-manager-kube-params-parameters" deleted
...
## patch
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/auth.yaml
profile.kubeflow.org "kubeflow-user-example-com" deleted
Error from server (NotFound): error when deleting "patch/auth.yaml": configmaps "dex" not found
Error from server (NotFound): error when deleting "patch/auth.yaml": deployments.apps "dex" not found
Error from server (NotFound): error when deleting "patch/auth.yaml": configmaps "default-install-config-9h2h2b6hbk" not found
...
## pv.yaml
root@master:/home/hqc/Kubeflow/Kubeflow1.3# kubectl delete -f pv.yaml
persistentvolume "pv-authservice" deleted
persistentvolume "pv-katib-mysql" deleted
2.4.2 创建pv
root@master:/home/hqc/Kubeflow/Kubeflow1.3# kubectl apply -f pv.yaml
persistentvolume/pv-authservice created
persistentvolume/pv-katib-mysql created
# 查看pv
root@master:/home/hqc/Kubeflow/Kubeflow1.3# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pv-authservice 25Gi RWO Retain Available 3m19s
pv-katib-mysql 25Gi RWO Retain Available 3m19s
2.4.3 也和挂载卷有关(好像是设置StorageClass)
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f local-path/local-path-storage.yaml
namespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created
2.4.4 部署manifest1.3
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f manifest1.3/001-cert-manager-cert-manager-kube-system-resources-base.yaml
role.rbac.authorization.k8s.io/cert-manager-cainjector:leaderelection created
role.rbac.authorization.k8s.io/cert-manager:leaderelection created
rolebinding.rbac.authorization.k8s.io/cert-manager-cainjector:leaderelection created
rolebinding.rbac.authorization.k8s.io/cert-manager-webhook:webhook-authentication-reader created
rolebinding.rbac.authorization.k8s.io/cert-manager:leaderelection created
configmap/cert-manager-kube-params-parameters created
......
# 第17个文件仍然出错,和pipeline相关
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f manifest1.3/017-pipeline-env-platform-agnostic-multi-user.yaml
error: unable to recognize "manifest1.3/017-pipeline-env-platform-agnostic-multi-user.yaml": no matches for kind "CompositeController" in version "metacontroller.k8s.io/v1alpha1"
......
相同报错问题
执行到这步目前是这个样子:
2.4.5 打补丁(patch)
因为一些patch安装涉及到的一些修改需要重启pod,所以需要先删除再安装
# 删除
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/auth.yaml
configmap "dex" deleted
deployment.apps "dex" deleted
configmap "default-install-config-9h2h2b6hbk" deleted
profile.kubeflow.org "kubeflow-user-example-com" deleted
# 安装
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/auth.yaml
configmap/dex created
deployment.apps/dex created
configmap/default-install-config-9h2h2b6hbk created
profile.kubeflow.org/kubeflow-user-example-com unchanged
# 删除
kubectl delete -f patch/cluster-local-gateway.yaml
deployment.apps "cluster-local-gateway" deleted
# 安装
kubectl apply -f patch/cluster-local-gateway.yaml
deployment.apps/cluster-local-gateway created
# 删除
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/data.yaml
Error from server (NotFound): error when deleting "patch/data.yaml": deployments.apps "minio" not found
# 之前就没有
# 安装
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/data.yaml
deployment.apps/minio created
# 删除
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/envoy-filter.yaml
envoyfilter.networking.istio.io "authn-filter" deleted
# 安装
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/envoy-filter.yaml
envoyfilter.networking.istio.io/authn-filter created
# 删除
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/istiod.yaml
deployment.apps "istiod" deleted
configmap "istio-sidecar-injector" deleted
# 安装
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/istiod.yaml
deployment.apps/istiod created
configmap/istio-sidecar-injector created
执行到此处,istiod
组件变成running状态
继续~
# 删除
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/istio-ingressgateway.yaml
deployment.apps "istio-ingressgateway" deleted
# 安装
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/istio-ingressgateway.yaml
deployment.apps/istio-ingressgateway created
执行到此处,cluster-local-gateway
和istio-ingressgateway
组件都变成running状态
而且此时可以访问登录界面,但看不到完整UI。
继续~
# 删除
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/jupyter-web-app.yaml
deployment.apps "jupyter-web-app-deployment" deleted
# 安装
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/jupyter-web-app.yaml
deployment.apps/jupyter-web-app-deployment created
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/kfserving.yaml
image.caching.internal.knative.dev "queue-proxy" deleted
configmap "config-deployment" deleted
configmap "inferenceservice-config" deleted
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/kfserving.yaml
image.caching.internal.knative.dev/queue-proxy created
configmap/config-deployment created
configmap/inferenceservice-config created
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests#kubectl delete -f patch/pipeline-env-platform-agnostic-multi-user.yaml
Error from server (NotFound): error when deleting "patch/pipeline-env-platform-agnostic-multi-user.yaml": configmaps "kubeflow-pipelines-profile-controller-code-c2cd68d9k4" not found
Error from server (NotFound): error when deleting "patch/pipeline-env-platform-agnostic-multi-user.yaml": configmaps "pipeline-install-config" not found
Error from server (NotFound): error when deleting "patch/pipeline-env-platform-agnostic-multi-user.yaml": deployments.apps "workflow-controller" not found
Error from server (NotFound): error when deleting "patch/pipeline-env-platform-agnostic-multi-user.yaml": deployments.apps "kubeflow-pipelines-profile-controller" not found
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/pipeline-env-platform-agnostic-multi-user.yaml
configmap/kubeflow-pipelines-profile-controller-code-c2cd68d9k4 created
configmap/pipeline-install-config created
deployment.apps/workflow-controller created
deployment.apps/kubeflow-pipelines-profile-controller created
# CreateContainerConfigError
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/tensorboard.yaml
deployment.apps "tensorboards-web-app-deployment" deleted
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/tensorboard.yaml
deployment.apps/tensorboards-web-app-deployment created
# running
到这一步,出现了很多组件,都running啦!
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/volumes-web-app.yaml
deployment.apps "volumes-web-app-deployment" deleted
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/volumes-web-app.yaml
deployment.apps/volumes-web-app-deployment created
# running
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/workflow-controller.yaml
deployment.apps "workflow-controller" deleted
Error from server (NotFound): error when deleting "patch/workflow-controller.yaml": configmaps "workflow-controller-configmap" not found
Error from server (NotFound): error when deleting "patch/workflow-controller.yaml": deployments.apps "cache-server" not found
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/workflow-controller.yaml
configmap/workflow-controller-configmap created
deployment.apps/workflow-controller created
deployment.apps/cache-server created
# 过了两三分钟,这个步骤部署的组件都没有ready
2.4.6 查看UI界面
是可以正常登录UI界面的,但存在几个问题
-
没有Namespace
问题解决参考,不过此方法对我好像没用,重装一遍之后就解决了。但仍有第2个问题。 -
Invalid Page
和pipeline
有关,因此应该是前面017
yaml文件没有部署成功的原因,也可能和后面的workflow
组件相关。
报错信息是:
# 第17个文件仍然出错,和pipeline相关
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f manifest1.3/017-pipeline-env-platform-agnostic-multi-user.yaml
error: unable to recognize "manifest1.3/017-pipeline-env-platform-agnostic-multi-user.yaml": no matches for kind "CompositeController" in version "metacontroller.k8s.io/v1alpha1"
通过查询好像是资源文件的版本定义过期了
的问题,只需将v1beta1
改成v1
,尝试~
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f manifest1.3/017-pipeline-env-platform-agnostic-multi-user.yaml
error: error validating "manifest1.3/017-pipeline-env-platform-agnostic-multi-user.yaml": error validating data: [ValidationError(CustomResourceDefinition.spec): unknown field "version" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): missing required field "versions" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec]; if you choose to ignore these errors, turn validation off with --validate=false
没解决!…
2.4.7 解决问题
解决了017.yaml
文件无法部署,打补丁遗留问题,相继地UI界面也解决了。
根据前面的报错信息:no matches for kind "CompositeController" in version "metacontroller.k8s.io/v1alpha1"
,判断是与CompositeController
相关的句段有问题。
通过不断更改其apiversions
测试,发现始终不行,因此决定直接注释掉相关句段尝试一下(有两处),因为各句段之间是独立的,注释掉也不会影响其他组件。
# 注释
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# vim manifest1.3/017-pipeline-env-platform-agnostic-multi-user.yaml
# 部署
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f manifest1.3/017-pipeline-env-platform-agnostic-multi-user.yaml
customresourcedefinition.apiextensions.k8s.io/clusterworkflowtemplates.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/controllerrevisions.metacontroller.k8s.io created
customresourcedefinition.apiextensions.k8s.io/cronworkflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/decoratorcontrollers.metacontroller.k8s.io created
customresourcedefinition.apiextensions.k8s.io/scheduledworkflows.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/viewers.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/workfloweventbindings.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtemplates.argoproj.io created
serviceaccount/argo created
serviceaccount/kubeflow-pipelines-cache created
serviceaccount/kubeflow-pipelines-cache-deployer-sa created
serviceaccount/kubeflow-pipelines-container-builder created
serviceaccount/kubeflow-pipelines-metadata-writer created
serviceaccount/kubeflow-pipelines-viewer created
serviceaccount/meta-controller-service created
serviceaccount/metadata-grpc-server created
serviceaccount/ml-pipeline created
serviceaccount/ml-pipeline-persistenceagent created
serviceaccount/ml-pipeline-scheduledworkflow created
serviceaccount/ml-pipeline-ui created
serviceaccount/ml-pipeline-viewer-crd-service-account created
serviceaccount/ml-pipeline-visualizationserver created
serviceaccount/mysql created
serviceaccount/pipeline-runner created
role.rbac.authorization.k8s.io/argo-role created
role.rbac.authorization.k8s.io/kubeflow-pipelines-cache-deployer-role created
role.rbac.authorization.k8s.io/kubeflow-pipelines-cache-role created
role.rbac.authorization.k8s.io/kubeflow-pipelines-metadata-writer-role created
role.rbac.authorization.k8s.io/ml-pipeline created
role.rbac.authorization.k8s.io/ml-pipeline-persistenceagent-role created
role.rbac.authorization.k8s.io/ml-pipeline-scheduledworkflow-role created
role.rbac.authorization.k8s.io/ml-pipeline-ui created
role.rbac.authorization.k8s.io/ml-pipeline-viewer-controller-role created
role.rbac.authorization.k8s.io/pipeline-runner created
clusterrole.rbac.authorization.k8s.io/aggregate-to-kubeflow-pipelines-edit created
clusterrole.rbac.authorization.k8s.io/aggregate-to-kubeflow-pipelines-view created
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-admin created
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-edit created
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-view created
clusterrole.rbac.authorization.k8s.io/argo-cluster-role created
clusterrole.rbac.authorization.k8s.io/kubeflow-pipelines-cache-deployer-clusterrole created
clusterrole.rbac.authorization.k8s.io/kubeflow-pipelines-cache-role created
clusterrole.rbac.authorization.k8s.io/kubeflow-pipelines-edit created
clusterrole.rbac.authorization.k8s.io/kubeflow-pipelines-metadata-writer-role created
clusterrole.rbac.authorization.k8s.io/kubeflow-pipelines-view created
clusterrole.rbac.authorization.k8s.io/ml-pipeline-persistenceagent-role created
clusterrole.rbac.authorization.k8s.io/ml-pipeline-scheduledworkflow-role created
clusterrole.rbac.authorization.k8s.io/ml-pipeline-ui created
clusterrole.rbac.authorization.k8s.io/ml-pipeline-viewer-controller-role created
clusterrole.rbac.authorization.k8s.io/ml-pipeline created
rolebinding.rbac.authorization.k8s.io/argo-binding created
rolebinding.rbac.authorization.k8s.io/kubeflow-pipelines-cache-binding created
rolebinding.rbac.authorization.k8s.io/kubeflow-pipelines-cache-deployer-rolebinding created
rolebinding.rbac.authorization.k8s.io/kubeflow-pipelines-metadata-writer-binding created
rolebinding.rbac.authorization.k8s.io/ml-pipeline created
rolebinding.rbac.authorization.k8s.io/ml-pipeline-persistenceagent-binding created
rolebinding.rbac.authorization.k8s.io/ml-pipeline-scheduledworkflow-binding created
rolebinding.rbac.authorization.k8s.io/ml-pipeline-ui created
rolebinding.rbac.authorization.k8s.io/ml-pipeline-viewer-crd-binding created
rolebinding.rbac.authorization.k8s.io/pipeline-runner-binding created
clusterrolebinding.rbac.authorization.k8s.io/argo-binding created
clusterrolebinding.rbac.authorization.k8s.io/kubeflow-pipelines-cache-binding created
clusterrolebinding.rbac.authorization.k8s.io/kubeflow-pipelines-cache-deployer-clusterrolebinding created
clusterrolebinding.rbac.authorization.k8s.io/kubeflow-pipelines-metadata-writer-binding created
clusterrolebinding.rbac.authorization.k8s.io/meta-controller-cluster-role-binding created
clusterrolebinding.rbac.authorization.k8s.io/ml-pipeline-persistenceagent-binding created
clusterrolebinding.rbac.authorization.k8s.io/ml-pipeline-scheduledworkflow-binding created
clusterrolebinding.rbac.authorization.k8s.io/ml-pipeline-ui created
clusterrolebinding.rbac.authorization.k8s.io/ml-pipeline-viewer-crd-binding created
clusterrolebinding.rbac.authorization.k8s.io/ml-pipeline created
configmap/kubeflow-pipelines-profile-controller-code-c2cd68d9k4 created
configmap/kubeflow-pipelines-profile-controller-env-5252m69c4c created
configmap/metadata-grpc-configmap created
configmap/ml-pipeline-ui-configmap created
configmap/pipeline-api-server-config-dc9hkg52h6 created
configmap/pipeline-install-config created
configmap/workflow-controller-configmap created
secret/mlpipeline-minio-artifact created
secret/mysql-secret created
service/cache-server created
service/kubeflow-pipelines-profile-controller created
service/metadata-envoy-service created
service/metadata-grpc-service created
service/minio-service created
service/ml-pipeline created
service/ml-pipeline-ui created
service/ml-pipeline-visualizationserver created
service/mysql created
service/workflow-controller-metrics created
persistentvolumeclaim/minio-pvc created
persistentvolumeclaim/mysql-pv-claim created
deployment.apps/cache-deployer-deployment created
deployment.apps/cache-server created
deployment.apps/kubeflow-pipelines-profile-controller created
deployment.apps/metadata-envoy-deployment created
deployment.apps/metadata-grpc-deployment created
deployment.apps/metadata-writer created
deployment.apps/minio created
deployment.apps/ml-pipeline created
deployment.apps/ml-pipeline-persistenceagent created
deployment.apps/ml-pipeline-scheduledworkflow created
deployment.apps/ml-pipeline-ui created
deployment.apps/ml-pipeline-viewer-crd created
deployment.apps/ml-pipeline-visualizationserver created
deployment.apps/mysql created
deployment.apps/workflow-controller created
statefulset.apps/metacontroller created
destinationrule.networking.istio.io/ml-pipeline created
destinationrule.networking.istio.io/ml-pipeline-minio created
destinationrule.networking.istio.io/ml-pipeline-mysql created
destinationrule.networking.istio.io/ml-pipeline-ui created
destinationrule.networking.istio.io/ml-pipeline-visualizationserver created
virtualservice.networking.istio.io/metadata-grpc created
virtualservice.networking.istio.io/ml-pipeline-ui created
authorizationpolicy.security.istio.io/metadata-grpc-service created
authorizationpolicy.security.istio.io/minio-service created
authorizationpolicy.security.istio.io/ml-pipeline created
authorizationpolicy.security.istio.io/ml-pipeline-ui created
authorizationpolicy.security.istio.io/ml-pipeline-visualizationserver created
authorizationpolicy.security.istio.io/mysql created
authorizationpolicy.security.istio.io/service-cache-server created
#创建了很多新的组件
大多数组件过一段实践可以running起来,(耐心等一段时间10min+)但此时还有很多组件异常。
下面就是补丁起作用啦!
# data.yaml
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/data.yaml
deployment.apps "minio" deleted
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/data.yaml
deployment.apps/minio created
# pipeline
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl delete -f patch/pipeline-env-platform-agnostic-multi-user.yaml
configmap "kubeflow-pipelines-profile-controller-code-c2cd68d9k4" deleted
configmap "pipeline-install-config" deleted
deployment.apps "workflow-controller" deleted
deployment.apps "kubeflow-pipelines-profile-controller" deleted
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/pipeline-env-platform-agnostic-multi-user.yaml
configmap/kubeflow-pipelines-profile-controller-code-c2cd68d9k4 created
configmap/pipeline-install-config created
deployment.apps/workflow-controller created
deployment.apps/kubeflow-pipelines-profile-controller created
# workflow-controller
root@master:/home/hqc/Kubeflow/Kubeflow1.3/kubeflow-manifests# kubectl apply -f patch/workflow-controller.yaml
configmap/workflow-controller-configmap configured
deployment.apps/workflow-controller unchanged
2.4.8 查看结果
root@master:/home/hqc/Kubeflow/Kubeflow1.3# kubectl get pod --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
auth dex-bb655f999-nw98h 1/1 Running 1 2d1h
cert-manager cert-manager-cainjector-846b7c9f8c-4sgvn 1/1 Running 16 2d1h
cert-manager cert-manager-fbc979d45-4nqpf 1/1 Running 3 2d1h
cert-manager cert-manager-webhook-67956cb44b-rxwfn 1/1 Running 1 2d1h
istio-system authservice-0 1/1 Running 1 2d1h
istio-system cluster-local-gateway-d8688cfdd-c8zxp 1/1 Running 1 2d1h
istio-system istio-ingressgateway-84f6567479-24v7c 1/1 Running 1 2d1h
istio-system istiod-5d6d848d84-2k44j 1/1 Running 1 2d1h
knative-eventing broker-controller-d675f7d9f-8xlcw 1/1 Running 1 2d1h
knative-eventing eventing-controller-8554597688-9k8vb 1/1 Running 1 2d1h
knative-eventing eventing-webhook-5f7565f8dd-m66bn 1/1 Running 1 2d1h
knative-eventing imc-controller-5fd9999fb-4nn4b 1/1 Running 1 2d1h
knative-eventing imc-dispatcher-fc5555f96-748zp 1/1 Running 1 2d1h
knative-serving activator-66c44ddffc-v98s2 1/1 Running 1 2d1h
knative-serving autoscaler-5c77d67c69-shpvv 1/1 Running 1 2d1h
knative-serving controller-b9bfbfb4f-qmgr6 1/1 Running 1 2d1h
knative-serving istio-webhook-5466fb9cd-kfgtk 1/1 Running 1 2d1h
knative-serving networking-istio-94d878c4b-xsbvc 1/1 Running 1 2d1h
knative-serving webhook-5d96ccb4fc-74g27 1/1 Running 1 2d1h
kube-system coredns-66bff467f8-p8txx 1/1 Running 20 45d
kube-system coredns-66bff467f8-qqrn9 1/1 Running 20 45d
kube-system etcd-master 1/1 Running 4 45d
kube-system kube-apiserver-master 1/1 Running 11565 45d
kube-system kube-controller-manager-master 1/1 Running 33 45d
kube-system kube-flannel-ds-8gb4m 1/1 Running 26 45d
kube-system kube-flannel-ds-tpnlj 1/1 Running 11 45d
kube-system kube-proxy-vrcts 1/1 Running 23 45d
kube-system kube-proxy-w8sv8 1/1 Running 4 45d
kube-system kube-scheduler-master 1/1 Running 34 45d
kubeflow admission-webhook-deployment-8678d7d5fc-w5llp 1/1 Running 1 2d1h
kubeflow cache-deployer-deployment-7cb5846cfb-w2zt4 2/2 Running 1 11m
kubeflow cache-server-7d5679f47f-tp75j 2/2 Running 0 11m
kubeflow centraldashboard-75466989b6-29hkv 1/1 Running 1 2d1h
kubeflow jupyter-web-app-deployment-b9df56ff-nz8xx 1/1 Running 1 2d1h
kubeflow katib-controller-b7b78dcf-v2pmf 1/1 Running 1 2d1h
kubeflow katib-db-manager-755464ffcf-946nd 1/1 Running 1 2d1h
kubeflow katib-mysql-f6b75dd75-5spgj 1/1 Running 1 2d1h
kubeflow katib-ui-7b997fd84f-hmn9d 1/1 Running 1 2d1h
kubeflow kfserving-controller-manager-0 2/2 Running 2 2d1h
kubeflow kubeflow-pipelines-profile-controller-65c8c9dc9c-mktlk 1/1 Running 0 4m22s
kubeflow metacontroller-0 1/1 Running 0 11m
kubeflow metadata-envoy-deployment-5b8555884c-7g4j9 1/1 Running 0 11m
kubeflow metadata-grpc-deployment-844fdd8f45-k5sr7 2/2 Running 5 11m
kubeflow metadata-writer-7b889fb74d-fjzkm 2/2 Running 2 11m
kubeflow minio-6f4c68d54f-tqvgm 2/2 Running 0 5m47s
kubeflow ml-pipeline-84bc5648fc-nz8rd 2/2 Running 4 11m
kubeflow ml-pipeline-persistenceagent-69d8f6d499-tcc6s 2/2 Running 1 11m
kubeflow ml-pipeline-scheduledworkflow-6cb4797f7f-nq2tx 2/2 Running 0 11m
kubeflow ml-pipeline-ui-56cc5c444b-kg7zp 2/2 Running 0 11m
kubeflow ml-pipeline-viewer-crd-67f54547b4-b2gg5 2/2 Running 1 11m
kubeflow ml-pipeline-visualizationserver-7b6ff7bf5f-qsqrs 2/2 Running 0 11m
kubeflow mpi-operator-6cd4967df-pwbdn 1/1 Running 3 2d1h
kubeflow mxnet-operator-65ddbb8bb7-kjh2f 1/1 Running 3 2d1h
kubeflow mysql-79cb69477c-6d7lv 2/2 Running 0 11m
kubeflow notebook-controller-deployment-7fb67c4d4c-sfgmc 1/1 Running 1 2d1h
kubeflow profiles-deployment-6888b86fc8-8v2dv 2/2 Running 2 2d1h
kubeflow pytorch-operator-5ccf6f746d-gt8xd 2/2 Running 5 2d1h
kubeflow tensorboard-controller-controller-manager-85fbc9cb98-rzw4n 3/3 Running 23 2d1h
kubeflow tensorboards-web-app-deployment-75d87f8559-xxvvh 1/1 Running 1 2d1h
kubeflow tf-job-operator-7c79b5b65f-kffrp 1/1 Running 16 2d1h
kubeflow volumes-web-app-deployment-64db74d95d-z2q2b 1/1 Running 1 2d1h
kubeflow workflow-controller-9f444667d-6cgmf 2/2 Running 2 4m22s
kubeflow xgboost-operator-deployment-7d8df579f5-jhx5g 2/2 Running 6 2d1h
local-path-storage local-path-provisioner-7c6fcb5b5f-8cg9f 1/1 Running 16 2d1h
root@master:/home/hqc/Kubeflow/Kubeflow1.3# kubectl -n kubeflow get all
NAME READY STATUS RESTARTS AGE
pod/admission-webhook-deployment-8678d7d5fc-w5llp 1/1 Running 1 2d1h
pod/cache-deployer-deployment-7cb5846cfb-w2zt4 2/2 Running 1 38m
pod/cache-server-7d5679f47f-tp75j 2/2 Running 0 38m
pod/centraldashboard-75466989b6-29hkv 1/1 Running 1 2d1h
pod/jupyter-web-app-deployment-b9df56ff-nz8xx 1/1 Running 1 2d1h
pod/katib-controller-b7b78dcf-v2pmf 1/1 Running 1 2d1h
pod/katib-db-manager-755464ffcf-946nd 1/1 Running 1 2d1h
pod/katib-mysql-f6b75dd75-5spgj 1/1 Running 1 2d1h
pod/katib-ui-7b997fd84f-hmn9d 1/1 Running 1 2d1h
pod/kfserving-controller-manager-0 2/2 Running 2 2d1h
pod/kubeflow-pipelines-profile-controller-65c8c9dc9c-mktlk 1/1 Running 0 31m
pod/metacontroller-0 1/1 Running 0 38m
pod/metadata-envoy-deployment-5b8555884c-7g4j9 1/1 Running 0 38m
pod/metadata-grpc-deployment-844fdd8f45-k5sr7 2/2 Running 5 38m
pod/metadata-writer-7b889fb74d-fjzkm 2/2 Running 2 38m
pod/minio-6f4c68d54f-tqvgm 2/2 Running 0 32m
pod/ml-pipeline-84bc5648fc-nz8rd 2/2 Running 4 38m
pod/ml-pipeline-persistenceagent-69d8f6d499-tcc6s 2/2 Running 1 38m
pod/ml-pipeline-scheduledworkflow-6cb4797f7f-nq2tx 2/2 Running 0 38m
pod/ml-pipeline-ui-56cc5c444b-kg7zp 2/2 Running 0 38m
pod/ml-pipeline-viewer-crd-67f54547b4-b2gg5 2/2 Running 1 38m
pod/ml-pipeline-visualizationserver-7b6ff7bf5f-qsqrs 2/2 Running 0 38m
pod/mpi-operator-6cd4967df-pwbdn 1/1 Running 3 2d1h
pod/mxnet-operator-65ddbb8bb7-kjh2f 1/1 Running 3 2d1h
pod/mysql-79cb69477c-6d7lv 2/2 Running 0 38m
pod/notebook-controller-deployment-7fb67c4d4c-sfgmc 1/1 Running 1 2d1h
pod/profiles-deployment-6888b86fc8-8v2dv 2/2 Running 2 2d1h
pod/pytorch-operator-5ccf6f746d-gt8xd 2/2 Running 5 2d1h
pod/tensorboard-controller-controller-manager-85fbc9cb98-rzw4n 3/3 Running 23 2d1h
pod/tensorboards-web-app-deployment-75d87f8559-xxvvh 1/1 Running 1 2d1h
pod/tf-job-operator-7c79b5b65f-kffrp 1/1 Running 16 2d1h
pod/volumes-web-app-deployment-64db74d95d-z2q2b 1/1 Running 1 2d1h
pod/workflow-controller-9f444667d-6cgmf 2/2 Running 2 31m
pod/xgboost-operator-deployment-7d8df579f5-jhx5g 2/2 Running 6 2d1h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/admission-webhook-service ClusterIP 10.107.240.242 <none> 443/TCP 2d1h
service/cache-server ClusterIP 10.103.205.52 <none> 443/TCP 38m
service/centraldashboard ClusterIP 10.105.105.223 <none> 80/TCP 2d1h
service/jupyter-web-app-service ClusterIP 10.96.92.138 <none> 80/TCP 2d1h
service/katib-controller ClusterIP 10.104.105.114 <none> 443/TCP,8080/TCP 2d1h
service/katib-db-manager ClusterIP 10.104.242.135 <none> 6789/TCP 2d1h
service/katib-mysql ClusterIP 10.104.127.193 <none> 3306/TCP 2d1h
service/katib-ui ClusterIP 10.102.31.52 <none> 80/TCP 2d1h
service/kfserving-controller-manager-metrics-service ClusterIP 10.107.165.77 <none> 8443/TCP 2d1h
service/kfserving-controller-manager-service ClusterIP 10.97.70.9 <none> 443/TCP 2d1h
service/kfserving-webhook-server-service ClusterIP 10.101.18.48 <none> 443/TCP 2d1h
service/kubeflow-pipelines-profile-controller ClusterIP 10.106.136.54 <none> 80/TCP 38m
service/metadata-envoy-service ClusterIP 10.103.159.117 <none> 9090/TCP 38m
service/metadata-grpc-service ClusterIP 10.109.160.91 <none> 8080/TCP 38m
service/minio-service ClusterIP 10.104.196.141 <none> 9000/TCP 38m
service/ml-pipeline ClusterIP 10.105.62.168 <none> 8888/TCP,8887/TCP 38m
service/ml-pipeline-ui ClusterIP 10.103.205.88 <none> 80/TCP 38m
service/ml-pipeline-visualizationserver ClusterIP 10.109.249.129 <none> 8888/TCP 38m
service/mysql ClusterIP 10.103.78.198 <none> 3306/TCP 38m
service/notebook-controller-service ClusterIP 10.109.155.251 <none> 443/TCP 2d1h
service/profiles-kfam ClusterIP 10.111.128.33 <none> 8081/TCP 2d1h
service/pytorch-operator ClusterIP 10.109.153.150 <none> 8443/TCP 2d1h
service/tensorboard-controller-controller-manager-metrics-service ClusterIP 10.101.132.68 <none> 8443/TCP 2d1h
service/tensorboards-web-app-service ClusterIP 10.102.18.212 <none> 80/TCP 2d1h
service/tf-job-operator ClusterIP 10.101.26.17 <none> 8443/TCP 2d1h
service/volumes-web-app-service ClusterIP 10.106.84.128 <none> 80/TCP 2d1h
service/workflow-controller-metrics ClusterIP 10.110.236.185 <none> 9090/TCP 38m
service/xgboost-operator-service ClusterIP 10.109.83.203 <none> 443/TCP 2d1h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/admission-webhook-deployment 1/1 1 1 2d1h
deployment.apps/cache-deployer-deployment 1/1 1 1 38m
deployment.apps/cache-server 1/1 1 1 38m
deployment.apps/centraldashboard 1/1 1 1 2d1h
deployment.apps/jupyter-web-app-deployment 1/1 1 1 2d1h
deployment.apps/katib-controller 1/1 1 1 2d1h
deployment.apps/katib-db-manager 1/1 1 1 2d1h
deployment.apps/katib-mysql 1/1 1 1 2d1h
deployment.apps/katib-ui 1/1 1 1 2d1h
deployment.apps/kubeflow-pipelines-profile-controller 1/1 1 1 31m
deployment.apps/metadata-envoy-deployment 1/1 1 1 38m
deployment.apps/metadata-grpc-deployment 1/1 1 1 38m
deployment.apps/metadata-writer 1/1 1 1 38m
deployment.apps/minio 1/1 1 1 32m
deployment.apps/ml-pipeline 1/1 1 1 38m
deployment.apps/ml-pipeline-persistenceagent 1/1 1 1 38m
deployment.apps/ml-pipeline-scheduledworkflow 1/1 1 1 38m
deployment.apps/ml-pipeline-ui 1/1 1 1 38m
deployment.apps/ml-pipeline-viewer-crd 1/1 1 1 38m
deployment.apps/ml-pipeline-visualizationserver 1/1 1 1 38m
deployment.apps/mpi-operator 1/1 1 1 2d1h
deployment.apps/mxnet-operator 1/1 1 1 2d1h
deployment.apps/mysql 1/1 1 1 38m
deployment.apps/notebook-controller-deployment 1/1 1 1 2d1h
deployment.apps/profiles-deployment 1/1 1 1 2d1h
deployment.apps/pytorch-operator 1/1 1 1 2d1h
deployment.apps/tensorboard-controller-controller-manager 1/1 1 1 2d1h
deployment.apps/tensorboards-web-app-deployment 1/1 1 1 2d1h
deployment.apps/tf-job-operator 1/1 1 1 2d1h
deployment.apps/volumes-web-app-deployment 1/1 1 1 2d1h
deployment.apps/workflow-controller 1/1 1 1 31m
deployment.apps/xgboost-operator-deployment 1/1 1 1 2d1h
NAME DESIRED CURRENT READY AGE
replicaset.apps/admission-webhook-deployment-8678d7d5fc 1 1 1 2d1h
replicaset.apps/cache-deployer-deployment-7cb5846cfb 1 1 1 38m
replicaset.apps/cache-server-7d5679f47f 1 1 1 38m
replicaset.apps/centraldashboard-75466989b6 1 1 1 2d1h
replicaset.apps/jupyter-web-app-deployment-b9df56ff 1 1 1 2d1h
replicaset.apps/katib-controller-b7b78dcf 1 1 1 2d1h
replicaset.apps/katib-db-manager-755464ffcf 1 1 1 2d1h
replicaset.apps/katib-mysql-f6b75dd75 1 1 1 2d1h
replicaset.apps/katib-ui-7b997fd84f 1 1 1 2d1h
replicaset.apps/kubeflow-pipelines-profile-controller-65c8c9dc9c 1 1 1 31m
replicaset.apps/metadata-envoy-deployment-5b8555884c 1 1 1 38m
replicaset.apps/metadata-grpc-deployment-844fdd8f45 1 1 1 38m
replicaset.apps/metadata-writer-7b889fb74d 1 1 1 38m
replicaset.apps/minio-6f4c68d54f 1 1 1 32m
replicaset.apps/ml-pipeline-84bc5648fc 1 1 1 38m
replicaset.apps/ml-pipeline-persistenceagent-69d8f6d499 1 1 1 38m
replicaset.apps/ml-pipeline-scheduledworkflow-6cb4797f7f 1 1 1 38m
replicaset.apps/ml-pipeline-ui-56cc5c444b 1 1 1 38m
replicaset.apps/ml-pipeline-viewer-crd-67f54547b4 1 1 1 38m
replicaset.apps/ml-pipeline-visualizationserver-7b6ff7bf5f 1 1 1 38m
replicaset.apps/mpi-operator-6cd4967df 1 1 1 2d1h
replicaset.apps/mxnet-operator-65ddbb8bb7 1 1 1 2d1h
replicaset.apps/mysql-79cb69477c 1 1 1 38m
replicaset.apps/notebook-controller-deployment-7fb67c4d4c 1 1 1 2d1h
replicaset.apps/profiles-deployment-6888b86fc8 1 1 1 2d1h
replicaset.apps/pytorch-operator-5ccf6f746d 1 1 1 2d1h
replicaset.apps/tensorboard-controller-controller-manager-85fbc9cb98 1 1 1 2d1h
replicaset.apps/tensorboards-web-app-deployment-75d87f8559 1 1 1 2d1h
replicaset.apps/tf-job-operator-7c79b5b65f 1 1 1 2d1h
replicaset.apps/volumes-web-app-deployment-64db74d95d 1 1 1 2d1h
replicaset.apps/workflow-controller-9f444667d 1 1 1 31m
replicaset.apps/xgboost-operator-deployment-7d8df579f5 1 1 1 2d1h
NAME READY AGE
statefulset.apps/kfserving-controller-manager 1/1 2d1h
statefulset.apps/metacontroller 1/1 38m
全部正常ready!