一、K8S搭建
当前AIperf集群使用1.21版本的k8s。
二、安装kubeflow
安装过程主要参考
GitHub - kubeflow/manifests at v1.5.1
国内参考:
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
这里建议使用 kustomize build example> all.yaml导出实际应用的yaml进行查看;建议先看一遍如下列出的可能存在的问题。
检查所有pod状态为running即可:
kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n auth
kubectl get pods -n knative-eventing
kubectl get pods -n knative-serving
kubectl get pods -n kubeflow
#这个是例子,不重要
kubectl get pods -n kubeflow-user-example-com
1、镜像问题
出现错误:ImagePullBackOff
由于kubeflow默认是使用的谷歌仓库的镜像,需要更换成国内镜像。
参考kubeflow安装_kubeflow 安装-CSDN博客
修改镜像:m.daocloud.io/(在镜像名开头添加)
2、PV/PVC问题
出现错误:
Warning FailedScheduling 16h default-scheduler 0/33 nodes are available: 33 pod has unbound immediate PersistentVolumeClaims.
PV/PVC概念介绍:k8s的持久化存储 PersistentVolume · 音视频/C++/k8s/Docker等等 学习笔记 · 看云
如果集群内添加了存储服务器,已经部署了对应的PVC,本次安装kubeflow时,只需要将对应的sc设为default,然后让pvc重启创建,自动关联就可以了(重启创建pvc需要先delete对应的yaml)
kubectl -n kubeflow get pvc
3、机器需要nvidia-container-runtime配置问题
出现错误:
(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "mt-broker-ingress-5d6c85f56b-w72p8": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/b6ef6b76119d8d6497d1cda3cb73d81c0af6c31dbb6beceabce1298824a9c218/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown
由于对应的master没有卡,必须使对应的pod调度到GPU机器上。同时,避免调度到master可以按照如下方法设置:
kubectl label nodes 《master_node》 nvidia.com/gpu.deploy.operands=false
4、部分pod互相依赖问题
mlpipeline 容器可能出现错误:
Failed to check if Minio bucket exists. Error: 503 Service Unavailable
实验发现minio容器和mlpipeline容器需要在同一个机器上。
5、鉴权等插件的错误
出现错误:
"Error opening bolt store: open /var/lib/authservice/data.db: permission denied
解决方法:
kubeflow安装, Error opening bolt store: open /var/lib/authservice/data.db: permission denied-CSDN博客
出现错误:
kubectl -n istio-system edit cm oidc-authservice-parameters
这里是因为对应的parameter设置中,8080多加了一组引号。debug时,可以在机器上手动启动容器查看端口设置是否成功。
参考:Unable to view pipelines UI · Issue #1931 · kubeflow/manifests · GitHub
kubectl edit destinationrule -n kubeflow ml-pipeline
kubectl edit destinationrule -n kubeflow ml-pipeline-ui
And edited the spec.trafficPolicy.tls.mode section, changing its value from ISTIO_MUTUAL to DISABLE. The I could visit the pipelines section.
修改后在runs/pipelines等部分,仍然出现错误:RBAC access denied
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: allow-all
namespace: istio-system
spec:
rules:
- {}
参考:[Kubeflow] RBAC: access denied / 2023.06.21
7、使用过程中出现CSRF cookie相关问题
出现错误:Could not find CSRF cookie XSRF-TOKEN in the request
解决方法:
1、按照 edited app secure cookies by BenzhaminKim · Pull Request #2155 · kubeflow/manifests · GitHub 修改部分yaml,添加配置。
2、在 contrib/kserve/models-web-app/overlays/kubeflow/kustomization.yaml 中添加APP_SECURE_COOKIES=false。
参考:Could not find CSRF cookie XSRF-TOKEN in the request · Issue #2225 · kubeflow/manifests · GitHub
8、training-operator出现CrashLoopBackOff
检查发现对应pod退出原因是:OOMkilled
修改对应的资源申请,提高memory申请即可。
9、upstream connect error or disconnect/reset before headers
重新apply到新集群后,出现错误:upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER
试图修复:
参考:Unable to view pipelines UI · Issue #1931 · kubeflow/manifests · GitHub
将ISTIO_MUTUAL修改为DISABLE:
kubectl edit destinationrule -n kubeflow ml-pipeline-ui
kubectl edit destinationrule -n kubeflow ml-pipeline
出现错误:
试图修复:
kubectl annotate deploy minio -n kubeflow sidecar.istio.io/inject=false
kubectl annotate deploy mysql sidecar.istio.io/inject=false -n kubeflow
10、一些有用方法
apply之后修改实际应用的yaml:
# 获取对应pod的deployment内容
kubectl describe po mysql-896768bbd-nghjz -n kubeflow
#定位
kubectl -n kubeflow get all
#修改
kubectl -n kubeflow edit deploy mysql
这个方法主要是让临时修改的配置生效,但也存在部分deploy在修改后出现了两个replica,有对应的两个pod被启动,因此不推荐。建议找到对应pod的yaml,delete之后,加入selector使得pod运行的节点可控,之后再apply。
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: authservice
namespace: istio-system
spec:
replicas: 1
selector:
matchLabels:
app: authservice
serviceName: authservice
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
labels:
app: authservice
spec:
nodeSelector:
mlpipeline: "true"
initContainers:
- name: fix-permission
image: busybox
command: ['sh', '-c']
args: ['chmod -R 777 /var/lib/authservice;']
volumeMounts:
- mountPath: /var/lib/authservice
name: data
containers:
- envFrom:
- secretRef:
name: oidc-authservice-client
- configMapRef:
name: oidc-authservice-parameters
image: m.daocloud.io/gcr.io/arrikto/kubeflow/oidc-authservice:6ac9400
imagePullPolicy: Always
name: authservice
ports:
- containerPort: 8080
name: http-api
readinessProbe:
httpGet:
path: /
port: 8081
volumeMounts:
- mountPath: /var/lib/authservice
name: data
securityContext:
fsGroup: 111
volumes:
- name: data
persistentVolumeClaim:
claimName: authservice-pvc
$kubectl delete pod cert-manager-webhook-fcd445bc4-xxzgs -n cert-manager --force --grace-period=0
kubectl get ns knative-serving -o json > ns.json
vim ns.json #kubectl patch namespace knative-serving -p '{"metadata":{"finalizers":[]}}
kubectl replace --raw "/api/v1/namespaces/knative-serving/finalize" -f ./ns.json
11、其他相关插件
langhorn相关:
Longhorn是一个轻量级、可靠、易用的kubernetes分布式块存储系统。
Longhorn - Kubernetes 的云原生分布式块存储 | Rancher
docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher --add-local=true
启动容器后登陆到界面查看
导入已有集群:Rancher导入原生Kubernetes集群_rancher导入自建集群-CSDN博客
安装longhorn
出现问题:
time="2024-01-03T08:22:40Z" level=fatal msg="Error starting manager: Failed environment check, please make sure you have iscsiadm/open-iscsi installed on the host: failed to execute: nsenter [--mount=/host/proc/94156/ns/mnt --net=/host/proc/94156/ns/net iscsiadm --version], output , stderr nsenter: failed to execute iscsiadm: No such file or directory\n: exit status 127" func=app.DaemonCmd.func1 file="daemon.go:83"
需要在对应node上安装iscsi-initiator-utils
还可能出现attach失败的问题:参考 [BUG] Configuration file `/etc/iscsi/initiatorname.iscsi` does not exist iscsi using `longhorn-iscsi-installation.yaml`. · Issue #2319 · longhorn/longhorn · GitHub
yum install iscsi-initiator-utils
sudo systemctl enable iscsid
sudo systemctl start iscsid