kubeflow实践(1.5.1版本)

一、K8S搭建

当前AIperf集群使用1.21版本的k8s。

二、安装kubeflow

安装过程主要参考

GitHub - kubeflow/manifests at v1.5.1

国内参考:

kubeflow安装_kubeflow 安装-CSDN博客

while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

这里建议使用 kustomize build example> all.yaml导出实际应用的yaml进行查看;建议先看一遍如下列出的可能存在的问题。

检查所有pod状态为running即可:

kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n auth

kubectl get pods -n knative-eventing

kubectl get pods -n knative-serving
kubectl get pods -n kubeflow

#这个是例子,不重要
kubectl get pods -n kubeflow-user-example-com

1、镜像问题

出现错误:ImagePullBackOff

由于kubeflow默认是使用的谷歌仓库的镜像,需要更换成国内镜像。

参考kubeflow安装_kubeflow 安装-CSDN博客

修改镜像:m.daocloud.io/(在镜像名开头添加)

2、PV/PVC问题

出现错误:

Warning FailedScheduling 16h default-scheduler 0/33 nodes are available: 33 pod has unbound immediate PersistentVolumeClaims.

PV/PVC概念介绍:k8s的持久化存储 PersistentVolume · 音视频/C++/k8s/Docker等等 学习笔记 · 看云

如果集群内添加了存储服务器,已经部署了对应的PVC,本次安装kubeflow时,只需要将对应的sc设为default,然后让pvc重启创建,自动关联就可以了(重启创建pvc需要先delete对应的yaml)

kubectl -n kubeflow get pvc

3、机器需要nvidia-container-runtime配置问题

出现错误:

(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "mt-broker-ingress-5d6c85f56b-w72p8": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/b6ef6b76119d8d6497d1cda3cb73d81c0af6c31dbb6beceabce1298824a9c218/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown

由于对应的master没有卡,必须使对应的pod调度到GPU机器上。同时,避免调度到master可以按照如下方法设置:

kubectl label nodes  《master_node》 nvidia.com/gpu.deploy.operands=false

4、部分pod互相依赖问题

mlpipeline 容器可能出现错误:

Failed to check if Minio bucket exists. Error: 503 Service Unavailable

实验发现minio容器和mlpipeline容器需要在同一个机器上。

5、鉴权等插件的错误

出现错误:

"Error opening bolt store: open /var/lib/authservice/data.db: permission denied

解决方法:

kubeflow安装, Error opening bolt store: open /var/lib/authservice/data.db: permission denied-CSDN博客

出现错误:

kubectl -n istio-system edit cm oidc-authservice-parameters

这里是因为对应的parameter设置中,8080多加了一组引号。debug时,可以在机器上手动启动容器查看端口设置是否成功。

参考:Unable to view pipelines UI · Issue #1931 · kubeflow/manifests · GitHub

kubectl edit destinationrule -n kubeflow ml-pipeline
kubectl edit destinationrule -n kubeflow ml-pipeline-ui
And edited the spec.trafficPolicy.tls.mode section, changing its value from ISTIO_MUTUAL to DISABLE. The I could visit the pipelines section.

修改后在runs/pipelines等部分,仍然出现错误:RBAC access denied

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
 name: allow-all
 namespace: istio-system
spec:
 rules:
 - {}

参考:[Kubeflow] RBAC: access denied / 2023.06.21

7、使用过程中出现CSRF cookie相关问题

出现错误:Could not find CSRF cookie XSRF-TOKEN in the request

解决方法:

1、按照 edited app secure cookies by BenzhaminKim · Pull Request #2155 · kubeflow/manifests · GitHub 修改部分yaml,添加配置。

2、在 contrib/kserve/models-web-app/overlays/kubeflow/kustomization.yaml 中添加APP_SECURE_COOKIES=false。

参考:Could not find CSRF cookie XSRF-TOKEN in the request · Issue #2225 · kubeflow/manifests · GitHub

8、training-operator出现CrashLoopBackOff

检查发现对应pod退出原因是:OOMkilled

修改对应的资源申请,提高memory申请即可。

9、upstream connect error or disconnect/reset before headers

重新apply到新集群后,出现错误:upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER

试图修复:

参考:Unable to view pipelines UI · Issue #1931 · kubeflow/manifests · GitHub

将ISTIO_MUTUAL修改为DISABLE:

kubectl edit destinationrule -n kubeflow ml-pipeline-ui

kubectl edit destinationrule -n kubeflow ml-pipeline

出现错误:

试图修复:

kubectl annotate deploy minio -n kubeflow sidecar.istio.io/inject=false

kubectl annotate deploy mysql sidecar.istio.io/inject=false -n kubeflow

10、一些有用方法

apply之后修改实际应用的yaml:

# 获取对应pod的deployment内容
kubectl describe po mysql-896768bbd-nghjz -n kubeflow
#定位
kubectl -n kubeflow get all
#修改
kubectl -n kubeflow edit deploy mysql

这个方法主要是让临时修改的配置生效,但也存在部分deploy在修改后出现了两个replica,有对应的两个pod被启动,因此不推荐。建议找到对应pod的yaml,delete之后,加入selector使得pod运行的节点可控,之后再apply。

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: authservice
  namespace: istio-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: authservice
  serviceName: authservice
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      labels:
        app: authservice
    spec:
      nodeSelector:
        mlpipeline: "true"
      initContainers:
      - name: fix-permission
        image: busybox
        command: ['sh', '-c']
        args: ['chmod -R 777 /var/lib/authservice;']
        volumeMounts:
        - mountPath: /var/lib/authservice
          name: data
      containers:
      - envFrom:
        - secretRef:
            name: oidc-authservice-client
        - configMapRef:
            name: oidc-authservice-parameters
        image: m.daocloud.io/gcr.io/arrikto/kubeflow/oidc-authservice:6ac9400
        imagePullPolicy: Always
        name: authservice
        ports:
        - containerPort: 8080
          name: http-api
        readinessProbe:
          httpGet:
            path: /
            port: 8081
        volumeMounts:
        - mountPath: /var/lib/authservice
          name: data
      securityContext:
        fsGroup: 111
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: authservice-pvc
$kubectl delete pod cert-manager-webhook-fcd445bc4-xxzgs -n cert-manager --force --grace-period=0

kubectl get ns knative-serving -o json > ns.json
vim ns.json #kubectl patch namespace knative-serving -p '{"metadata":{"finalizers":[]}}
kubectl replace --raw "/api/v1/namespaces/knative-serving/finalize" -f ./ns.json
11、其他相关插件

langhorn相关:

Longhorn是一个轻量级、可靠、易用的kubernetes分布式块存储系统。

Longhorn - Kubernetes 的云原生分布式块存储 | Rancher

docker run --privileged -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher --add-local=true

启动容器后登陆到界面查看

导入已有集群:Rancher导入原生Kubernetes集群_rancher导入自建集群-CSDN博客

安装longhorn

出现问题:

time="2024-01-03T08:22:40Z" level=fatal msg="Error starting manager: Failed environment check, please make sure you have iscsiadm/open-iscsi installed on the host: failed to execute: nsenter [--mount=/host/proc/94156/ns/mnt --net=/host/proc/94156/ns/net iscsiadm --version], output , stderr nsenter: failed to execute iscsiadm: No such file or directory\n: exit status 127" func=app.DaemonCmd.func1 file="daemon.go:83"

需要在对应node上安装iscsi-initiator-utils

还可能出现attach失败的问题:参考 [BUG] Configuration file `/etc/iscsi/initiatorname.iscsi` does not exist iscsi using `longhorn-iscsi-installation.yaml`. · Issue #2319 · longhorn/longhorn · GitHub

yum install iscsi-initiator-utils
sudo systemctl enable iscsid
sudo systemctl start iscsid

  • 16
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值