基于KubeRay提交RayJob
0.背景
基于kuberay-operator 0.4.0版本
1.问题
提交作业
kubectl apply -f ray_v1alpha1_rayjob.yaml
问题报错:
2023-05-10T12:24:20.131Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:20.131Z INFO controllers.RayJob waiting for the cluster to be ready {"rayCluster": "rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:23.132Z INFO controllers.RayJob reconciling RayJob {"NamespacedName": "default/rayjob-sample"}
2023-05-10T12:24:23.132Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:23.132Z INFO controllers.RayJob waiting for the cluster to be ready {"rayCluster": "rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:26.133Z INFO controllers.RayJob reconciling RayJob {"NamespacedName": "default/rayjob-sample"}
2023-05-10T12:24:26.133Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:26.133Z INFO controllers.RayJob waiting for the cluster to be ready {"rayCluster": "rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:29.134Z INFO controllers.RayJob reconciling RayJob {"NamespacedName": "default/rayjob-sample"}
2023-05-10T12:24:29.134Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:24:29.135Z INFO controllers.RayJob waiting for the cluster to be ready {"rayCluster": "rayjob-sample-raycluster-9jtn6"}
2.分析
kuberay-operator中已经存在的问题:https://github.com/ray-project/kuberay/issues/1002
3.解决方案
方案1:
重启operator的pod:
登录CCE的界面,重新部署kuberay-operator实例,则作业会提交执行

方案2:
将kuberay-operator升级到0.5.1

再次提交就成功了
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob# kubectl apply -f ray_v1alpha1_rayjob.yaml
rayjob.ray.io/rayjob-sample created
configmap/ray-job-code-sample created
在使用kuberay-operator0.4.0版本尝试提交RayJob时遇到问题,日志显示一直在等待集群准备就绪。查阅相关问题(#1002)后,提供了两种解决方案:一是重启operator的Pod,二是将operator升级至0.5.1版本。后者成功解决了作业提交问题。
1384

被折叠的 条评论
为什么被折叠?



