跑RayJob遇到的问题4:提交rayjob之后一直卡在waiting for the cluster to be ready

在使用kuberay-operator0.4.0版本尝试提交RayJob时遇到问题,日志显示一直在等待集群准备就绪。查阅相关问题(#1002)后,提供了两种解决方案:一是重启operator的Pod,二是将operator升级至0.5.1版本。后者成功解决了作业提交问题。
摘要由CSDN通过智能技术生成

基于KubeRay提交RayJob

0.背景

基于kuberay-operator 0.4.0版本

1.问题

提交作业

 kubectl apply -f ray_v1alpha1_rayjob.yaml

问题报错:

2023-05-10T12:24:20.131Z        INFO    controllers.RayJob      RayJob associated rayCluster found      {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:20.131Z        INFO    controllers.RayJob      waiting for the cluster to be ready     {"rayCluster": "rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:23.132Z        INFO    controllers.RayJob      reconciling RayJob      {"NamespacedName": "default/rayjob-sample"}
2023-05-10T12:24:23.132Z        INFO    controllers.RayJob      RayJob associated rayCluster found      {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:23.132Z        INFO    controllers.RayJob      waiting for the cluster to be ready     {"rayCluster": "rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:26.133Z        INFO    controllers.RayJob      reconciling RayJob      {"NamespacedName": "default/rayjob-sample"}
2023-05-10T12:24:26.133Z        INFO    controllers.RayJob      RayJob associated rayCluster found      {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:26.133Z        INFO    controllers.RayJob      waiting for the cluster to be ready     {"rayCluster": "rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:29.134Z        INFO    controllers.RayJob      reconciling RayJob      {"NamespacedName": "default/rayjob-sample"}
2023-05-10T12:24:29.134Z        INFO    controllers.RayJob      RayJob associated rayCluster found      {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:29.135Z        INFO    controllers.RayJob      waiting for the cluster to be ready     {"rayCluster": "rayjob-sample-raycluster-9jtn6"}

2.分析

kuberay-operator中已经存在的问题:https://github.com/ray-project/kuberay/issues/1002

3.解决方案

方案1:

重启operator的pod:
登录CCE的界面,重新部署kuberay-operator实例,则作业会提交执行
在这里插入图片描述

方案2:

将kuberay-operator升级到0.5.1

在这里插入图片描述

再次提交就成功了

root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob# kubectl apply -f ray_v1alpha1_rayjob.yaml
rayjob.ray.io/rayjob-sample created
configmap/ray-job-code-sample created
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值