Spark on K8S(spark-on-kubernetes-operator)环境搭建
环境要求
Operator Version:最新即可
Kubernetes Version: 1.13或更高
Spark Version:2.4.4或以上,我用的是2.4.4
Operator image:最新即可
基本原理
Spark作为计算模型,搭配资源调度+存储服务即可发挥作用,一直用的是Yarn+HDFS,近期考虑尝试使用Spark+HDFS进行计算,因此本质上是对资源调度框架进行替换;
Yarn在资源调度的逻辑单位是Container,Container在资源管理上对比K8S存在一些不足,没有完全的做到计算任务的物理环境独立(例如python/java/TensorFlow等混部计算),在K8S上可用过容器实现相同物理资源的不同运行环境隔离,充分利用资源,简单来看两者的区别:
- job on pod VS job on container;
- 集中调度(api-server) VS 两级调度(ResourceManager/ApplicationMaster);
- 容器实现物理资源独立共享,计算环境隔离 VS Container实现物理资源独立共享(磁盘和网络IO除外),计算环境耦合
- I层弹性 VS none
用户角度看,两种提交任务的流程:
Spark on Yarn
spark-submit ---- ResourceManager ----- ApplicaitonMaster(Container) ---- Driver(Container)----Executor(Container)
PS:如果Cluster模式则Driver随机用一个Container启动;Client模式则会在提交方本地启动;
Spark on Kubernetes
spark-submit ---- Kube-api-server(Pod) ---- Kube-scheduler(Pod) ---- Driver(Pod) ---- Executor(Pod)
PS:和Deployment/Statefulset不同,Spark在调度执行时缺少自定义的controller,因此在集群中提交后看到的就是driver+executor的pod,没有deployment/statefulset等类似的controller管理;
本身在1.13的K8S版本也是直接可以submit的,为了方便用户使用,通过CRD(Custumer Resource Definition?),定义SparkApplication,用户可以直接在K8S上申请创建该资源对象,Spark的submit过程在该CRD内完成,我理解operator支持新CRD类型的作用不是很明显,但是可以支持了独立调度,pod状态监控,把driver/executor的pod封装为sparkApplication整体监控,executor失败重试啥的,方便后续的开发和维护,所以尝试一下;
初步的了解可以参考如下两部分内容:
https://www.slidestalk.com/AliSpark/MicrosoftPowerPoint55236?video
http://www.sohu.com/a/343653335_315839
分享一篇调度的文章:
https://io-meter.com/2018/02/09/A-summary-of-designing-schedulers/
关于性能,有人也做过测试比对:
https://xiaoxubeii.github.io/articles/practice-of-spark-on-kubernetes/
官方的图画的比较明白了:
- submit过程拿出来,用sparkctl做
- 加了controller,支持了sparkApplication类型
- 调度抽出来由operator自己做
感兴趣可以跳转到这里看看简答的介绍:https://www.slideshare.net/databricks/apache-spark-on-k8s-best-practice-and-performance-in-the-cloud
具体到operator的流程,包括一下几步:
- 提交sparkApplication的请求到api-server
- 把sparkApplication的CRD持久化到etcd
- operator订阅发现有sparkApplication,获取后通过submission runner提交spark-submit过程,请求给到api-server后生成对应的driver/executor pod
- spark pod monitor会监控到application执行的状态(所以通过sparkctl可以通过list、status看到)
- mutating adminission webhook建svc,可以查看spark web ui
通过这种方式,我们的submit任务过程可以和部署一个deployment/statefulset一样简单,镜像+yaml,同时还可以使用comfigMap、secret、volume等K8S的资源,这样比自己submit,通过–conf或在镜像里固定写死要方便很多(当然Spark在3.0版本也增加了一些配置项,但归根结底还是要在submit过程或打镜像时来管理的)
环境安装
kubernetes 1.13环境安装
得写一篇流水账,不单独写了,放在这里:
https://blog.csdn.net/weixin_42305433/article/details/103931032
https://blog.csdn.net/weixin_42305433/article/details/103931045
Spark-on-kubernetes-operator环境安装
也得写一篇流水账,不单独写了,放在这里:
https://blog.csdn.net/weixin_42305433/article/details/103930666
Demo过程
准备spark-pi镜像
准备就用官方推荐的spark-pi来实践一下,该jar包在spark2.4.4镜像里有,后来由于想本地调测一些东西,在spark github上对应目录能找到该example源码,官方源码如下:
// scalastyle:off println
package org.apache.spark.examples
import scala.math.random
import org.apache.spark.sql.SparkSession
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("Spark Pi")
.getOrCreate()
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y <= 1) 1 else 0
}.reduce(_ + _)
println(s"Pi is roughly ${4.0 * count / (n - 1)}")
spark.stop()
}
}
将代码打包并生成jar包,然后编写dockerfile生成对应的镜像,dockerfile如下:
ARG SPARK_IMAGE=gcr.io/spark-operator/spark:v2.4.4
FROM gcr.io/spark-operator/spark:v2.4.4
ADD ./SparkPi-1.0-sleep-SNAPSHOT.jar /go/SparkPi-1.0-sleep-SNAPSHOT.jar
RUN chmod +x /go/SparkPi-1.0-sleep-SNAPSHOT.jar
此处jar包的名字自己可以替换一下,加了个sleep只是打包名字而已,通过:
docker build -t spark-pi:\<tag> .
docker tag spark-pi:\<tag> 私仓地址:端口\spark-pi:\<tag>
docker push 私仓地址:端口\spark-pi:\<tag>
生成镜像在本地仓库,需要将本地镜像推到私仓后其他节点才能拉取,这里需要注意的是,docker build的目录要尽量不要放其他的东西,提高打镜像的效率;
创建spark-pi任务
镜像打好之后,就可以根据spark-on-kubernetes-operator的说明进行demo的提交,提交过程非常简单,通过以下命令完成:
sparkctl create spark-pi.yaml
在用原生的spark-on-kubernetes时就没这个了,直接用spark-submit搞了,这里需要编辑一个SparkApplicaiton的CRD,用spark-operator来完成spark任务的管理、提交过程,此时集群中可以看到:
[root@linux100-99-81-13 test]# kubectl get pod -n spark-operator
NAME READY STATUS RESTARTS AGE
littering-woodpecker-sparkoperator-9456c4d8-ktcdq 1/1 Running 0 45d
littering-woodpecker-sparkoperator-init-v7mw8 0/1 Completed 0 45d
提交CRD的yaml如下:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
namespace: default
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v2.4.4"
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar"
sparkVersion: "2.4.4"
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
nodeSelector:
type: drivers
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 2.4.4
serviceAccount: spark
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
executor:
nodeSelector:
type: executors
cores: 1
instances: 5
memory: "512m"
labels:
version: 2.4.4
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
提交后,通过kubectl或者sparkctl命令均可以查看:
[root@linux100-99-81-13 test]# sparkctl event spark-pi
+------------+--------+----------------------------------------------------+
| TYPE | AGE | MESSAGE |
+------------+--------+----------------------------------------------------+
| Normal | 13s | SparkApplication spark-pi |
| | | was added, enqueuing it for |
| | | submission |
| Normal | 9s | SparkApplication spark-pi was |
| | | submitted successfully |
| Normal | 8s | Driver spark-pi-driver is |
| | | running |
| Normal | 0s | Executor |
| | | spark-pi-1578926994055-exec-1 |
| | | is pending |
| Normal | 0s | Executor |
| | | spark-pi-1578926994055-exec-2 |
| | | is pending |
| Normal | 0s | Executor |
| | | spark-pi-1578926994055-exec-3 |
| | | is pending |
| Normal | 0s | Executor |
| | | spark-pi-1578926994055-exec-4 |
| | | is pending |
| Normal | 0s | Executor |
| | | spark-pi-1578926994055-exec-5 |
| | | is pending |
+------------+--------+----------------------------------------------------+
此刻对应的kubectl命令可以看到,driver和executor都分配了一个pod来运行:
[root@linux100-99-81-13 test]# kubectl get pod
NAME READY STATUS RESTARTS AGE
spark-pi-1578927078367-exec-1 1/1 Running 0 3s
spark-pi-1578927078367-exec-2 1/1 Running 0 3s
spark-pi-1578927078367-exec-3 1/1 Running 0 3s
spark-pi-1578927078367-exec-4 1/1 Running 0 3s
spark-pi-1578927078367-exec-5 1/1 Running 0 2s
spark-pi-driver 1/1 Running 0 13s
运行完成后,仅剩下了driver的pod处于complete状态
[root@linux100-99-81-13 test]# kubectl get pod
NAME READY STATUS RESTARTS AGE
spark-pi-driver 0/1 Completed 0 39s
此时再用sparkctl查看状态可以看到:
[root@linux100-99-81-13 test]# sparkctl status spark-pi
application state:
+-----------+----------------+----------------+-----------------+---------------------+--------------------+-------------------+
| STATE | SUBMISSION AGE | COMPLETION AGE | DRIVER POD | DRIVER UI | SUBMISSIONATTEMPTS | EXECUTIONATTEMPTS |
+-----------+----------------+----------------+-----------------+---------------------+--------------------+-------------------+
| COMPLETED | 1m | 46s | spark-pi-driver | 10.105.250.204:4040 | 1 | 1 |
+-----------+----------------+----------------+-----------------+---------------------+--------------------+-------------------+
executor state:
+-------------------------------+--------+
| EXECUTOR POD | STATE |
+-------------------------------+--------+
| spark-pi-1578927078367-exec-2 | FAILED |
| spark-pi-1578927078367-exec-3 | FAILED |
| spark-pi-1578927078367-exec-4 | FAILED |
| spark-pi-1578927078367-exec-5 | FAILED |
| spark-pi-1578927078367-exec-1 | FAILED |
+-------------------------------+--------+
此处的executor的状态都是FAILED,但是查看driver的日志可以看到pi的计算正常完成:
[root@linux100-99-81-13 test]# sparkctl log spark-pi | tail -n 30
20/01/13 11:46:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 117.0 MB)
20/01/13 11:46:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 117.0 MB)
20/01/13 11:46:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-pi-1578915970250-driver-svc.default.svc:7079 (size: 1256.0 B, free: 117.0 MB)
20/01/13 11:46:26 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1161
20/01/13 11:46:26 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
20/01/13 11:46:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
20/01/13 11:46:26 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.44.0.7, executor 4, partition 0, PROCESS_LOCAL, 7885 bytes)
20/01/13 11:46:26 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.44.0.4, executor 1, partition 1, PROCESS_LOCAL, 7885 bytes)
20/01/13 11:46:27 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.44.0.7:35987 (size: 1256.0 B, free: 117.0 MB)
20/01/13 11:46:27 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.44.0.4:39765 (size: 1256.0 B, free: 117.0 MB)
20/01/13 11:46:27 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 773 ms on 10.44.0.4 (executor 1) (1/2)
20/01/13 11:46:27 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 797 ms on 10.44.0.7 (executor 4) (2/2)
20/01/13 11:46:27 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/01/13 11:46:27 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 1.214 s
20/01/13 11:46:27 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.378283 s
Pi is roughly 3.1415757078785393
20/01/13 11:46:27 INFO SparkUI: Stopped Spark web UI at http://spark-pi-1578915970250-driver-svc.default.svc:4040
20/01/13 11:46:27 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
20/01/13 11:46:27 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
20/01/13 11:46:27 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
20/01/13 11:46:27 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/01/13 11:46:27 INFO MemoryStore: MemoryStore cleared
20/01/13 11:46:27 INFO BlockManager: BlockManager stopped
20/01/13 11:46:27 INFO BlockManagerMaster: BlockManagerMaster stopped
20/01/13 11:46:27 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/01/13 11:46:27 INFO SparkContext: Successfully stopped SparkContext
20/01/13 11:46:27 INFO ShutdownHookManager: Shutdown hook called
20/01/13 11:46:27 INFO ShutdownHookManager: Deleting directory /var/data/spark-73fb44d0-9bb5-4dc7-bc2c-b33b2dddb989/spark-baf95523-bf0a-42f2-976f-599a4f044d6c
20/01/13 11:46:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-8b8888c1-888b-4be7-9468-6ba3f0f0e000
executor状态异常的问题先mark一下,后面在做研究。
至此一个最简单的demo就好了,可以看到计算完成之后:
- spark-pi该sparkApplication不会消失,通过sparkctl list可以看到该application仍然存在
- spark-pi的driver在执行完成后不会自动删除,此处需要自行删除
- 同样的spark-pi.yaml再次创建时会失败,因为spark-pi这个sparkApplicaiton还没有删除掉