1.准备
概念:Ray job至少有三种情况,
- 第一种:先起ray集群,再往运行中的ray集群提交作业:https://docs.ray.io/en/latest/cluster/running-applications/job-submission/cli.html#
- 第二种:部署kuberay-operator,生成RayJob的kubernetes自定义CR,然后提交RayJob:https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayjob.md
- 第三种:ray集成volcano(使用queue和podgroup):https://github.com/ray-project/kuberay/blob/master/docs/guidance/volcano-integration.md
本篇博客主要是体验第二种
准备:
- 先有kubernetes集群,本篇博客运行在华为云CCE上,已经有了kubernetes,支持helm插件等
- 本地安装kubectl和helm等工具
计划:
- 安装kuberay-operator 0.5.1 helm chart, 镜像版本为0.5.0
- 安装kuberay-apiserver 0.5.1 helm chart, 镜像版本为0.5.0
- job使用rayproject/ray:v2.4.0版本
2. 下载
搜索kuberay相关组件:
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray# helm search repo kuberay
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config
NAME CHART VERSION APP VERSION DESCRIPTION
kuberay/kuberay-apiserver 0.5.1 A Helm chart for kuberay-apiserver
kuberay/kuberay-operator 0.5.1 A Helm chart for Kubernetes
kuberay/ray-cluster 0.5.1 A Helm chart for Kubernetes
下载:
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray# helm fetch kuberay/kuberay-apiserver
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray# helm fetch kuberay/ray-cluster
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config
解压:
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray# tar -xvf kuberay-operator-0.5.1.tgz
kuberay-operator/Chart.yaml
kuberay-operator/values.yaml
kuberay-operator/templates/_helpers.tpl
kuberay-operator/templates/deployment.yaml
kuberay-operator/templates/leader_election_role.yaml
kuberay-operator/templates/leader_election_role_binding.yaml
kuberay-operator/templates/ray_rayjob_editor_role.yaml
kuberay-operator/templates/ray_rayjob_viewer_role.yaml
kuberay-operator/templates/ray_rayservice_editor_role.yaml
kuberay-operator/templates/ray_rayservice_viewer_role.yaml
kuberay-operator/templates/role.yaml
kuberay-operator/templates/rolebinding.yaml
kuberay-operator/templates/service.yaml
kuberay-operator/templates/serviceaccount.yaml
kuberay-operator/.helmignore
kuberay-operator/README.md
kuberay-operator/crds/ray.io_rayclusters.yaml
kuberay-operator/crds/ray.io_rayjobs.yaml
kuberay-operator/crds/ray.io_rayservices.yaml
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray# tar -xvf kuberay-apiserver-0.5.1.tgz
kuberay-apiserver/Chart.yaml
kuberay-apiserver/values.yaml
kuberay-apiserver/templates/_helpers.tpl
kuberay-apiserver/templates/deployment.yaml
kuberay-apiserver/templates/ingress.yaml
kuberay-apiserver/templates/role.yaml
kuberay-apiserver/templates/rolebinding.yaml
kuberay-apiserver/templates/service.yaml
kuberay-apiserver/templates/serviceaccount.yaml
kuberay-apiserver/.helmignore
kuberay-apiserver/README.md
3.下载镜像
下载镜像并且提交到华为云SWR
docker pull kuberay/operator:v0.5.0
docker tag kuberay/operator:v0.5.0 swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/kuberay/operator:v0.5.0
docker push swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/kuberay/operator:v0.5.0
docker pull kuberay/apiserver:v0.5.0
docker tag kuberay/apiserver:v0.5.0 swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/kuberay/apiserver:v0.5.0
docker push swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/kuberay/apiserver:v0.5.0
docker tag rayproject/ray:2.4.0 swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0
docker push swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0
4.修改helm chart
替换kuberay-operator中的values:

替换kuberay-apiserver中的values:

5.安装kuberay的相关组件
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray# helm install kuberay-operator kuberay-operator
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config
NAME: kuberay-operator
LAST DEPLOYED: Wed May 10 20:46:00 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray# helm install kuberay-apiserver kuberay-apiserver
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config
NAME: kuberay-apiserver
LAST DEPLOYED: Wed May 10 20:46:46 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
登录华为云CCE查看Kuberay相关组件是否安装好:
路径:CCE=》工作负载=》kuberay-operator或者kuberay-apiserver=>容器配置=》镜像访问凭证=》default-secert=》提交


如果镜像拉取不成功,需要配置default secret:


6.下载ray job配置文件:
文件源:https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray_v1alpha1_rayjob.yaml
修改1:修改runtime,去掉pip依赖包安装,runtimeEnv为base64编码,
{
"pip": [ ],
"env_vars": {"counter_name": "test_counter"}
}
base64之后为:
ewogICAgInBpcCI6IFsgXSwKICAgICJlbnZfdmFycyI6IHsiY291bnRlcl9uYW1lIjogInRlc3RfY291bnRlciJ9Cn0K
修改处:

修改2:
增加 serviceType: “ClusterIP”

修改3:
增加 imagePullSecrets

修改4:修改镜像地址:


6 启动rayjob
1)提交job
提交job之后会先启动ray集群,然后再submitrayjob
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob# kubectl apply -f ray_v1alpha1_rayjob.yaml
rayjob.ray.io/rayjob-sample created
configmap/ray-job-code-sample created
2) 查看集群
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray# kubectl get rayclusters -o wide
NAME AGE
ray2-kuberay 49d
rayjob-sample-raycluster-rd84d 26m
3)查看rayjob
kubectl describe rayjobs rayjob-sample

4)查看ray dashbord
获取service名字或地址:
^Croot@DESKTOP-3813A3M:/mnt/d/all/app/Ray# kubectget svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kuberay-apiserver-service NodePort 10.247.162.162 <none> 8888:31888/TCP,8887:31887/TCP 36m
kuberay-operator ClusterIP 10.247.236.242 <none> 8080/TCP 37m
kubernetes ClusterIP 10.247.0.1 <none> 443/TCP 56d
notebook-proxy NodePort 10.247.154.164 <none> 80:30528/TCP 49d
ray2-kuberay-head-svc ClusterIP 10.247.206.89 <none> 10001/TCP,6379/TCP,8265/TCP,8080/TCP,8000/TCP 49d
rayjob-sample-raycluster-rd84d-head-svc ClusterIP 10.247.255.69 <none> 8080/TCP,6379/TCP,8265/TCP,10001/TCP,8000/TCP 6s
配置转发,方便本地访问
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray# kubectl port-forward service/rayjob-sample-raycluster-rd84d-head-svc 8265:8265
Forwarding from 127.0.0.1:8265 -> 8265
Forwarding from [::1]:8265 -> 8265
Handling connection for 8265
Handling connection for 8265
Handling connection for 8265
浏览器打开127.0.0.1:8265地址既可以访问
查看cluster:

查看job:

查看job运行日志:

5) 删除rayjob(可选):
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob# kubectl delete -f ray_v1alpha1_rayjob.yaml
rayjob.ray.io "rayjob-sample" deleted
configmap "ray-job-code-sample" deleted
7补充:
1) 确实起了一个ray cluster

2)job完成自动关闭
在job中配置:
shutdownAfterJobFinishes: true

3)执行时间分析:
根据kubectl describe rayjobs rayjob-sample分析:
大概10秒钟左右完成ray集群启动,ray job大概9秒钟,删除集群6秒钟


331

被折叠的 条评论
为什么被折叠?



