Spark-Core源码学习记录
该系列作为Spark源码回顾学习的记录,旨在捋清Spark分发程序运行的机制和流程,对部分关键源码进行追踪,争取做到知其所以然,对枝节部分源码仅进行文字说明,不深入下钻,避免混淆主干内容。
前面篇章中,我们完成了Master
与Worker
的注册启动,Driver
和Executor
的注册启动,Application
的注册与启动。初始化了SparkContext、SchedulerBackend、TaskScheduler
,最终通过schedule()
方法完成硬件资源的分配。万事俱备,只欠东风,应用如何被划分成Stage
,以及Stage
如何分发成具体的Task
给Executor
执行?下面我们就进入JavaWordCount
应用程序的末尾output = counts.collect();
从一个Action算子开始
count方法内部调用SparkContext的runJob方法,我们省略掉内部的多次周转,直接展现最终的调用,
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
// ... 省略runJob内部各种调用,下面是最终的调用
/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.
* 这是一个所有action算子的主要入口
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @param partitions set of partitions to run on; some jobs may not want to compute on all
* partitions of the target RDD, e.g. for operations like `first()`
* @param resultHandler callback to pass each result to
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
// 原来最终是靠 dagScheduler完成,这才是我们下面要关注的重点
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
// checkpoint,为了RDD的复用
rdd.doCheckpoint()
}
查看DagScheduler
的runJob方法:
def runJob[T, U](...): Unit = {
// Submit an action job to the scheduler.
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
// Preferred alternative to `Await.ready()`
ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
}
进入submitJob方法
/**
* Submit an action job to the scheduler.
* @return a JobWaiter object that can be used to block until the job finishes executing or can be used to cancel the job.
*/
def submitJob[T, U](...): JobWaiter[U] = {
val jobId = nextJobId.getAndIncrement()
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
// 实例化一个 JobWaiter,内部是一些状态记录的成员
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
// 还记的前面实例化 DAGScheduler的时候,提及的 eventProcessLoop,类似于Rpc中的 Dispatcher,通过一个循环线程来处理一个队列 eventQueue中的消息
// 此处post就是往 eventQueue中放入一个模板类 JobSubmitted,等待循环线程来处理就可以
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
我们回想一下eventProcessLoop的逻辑,就是循环从eventQueue中取出具体事件,然后调用doOnReceive(event)
进行处理,具体细节可回顾前篇文章。
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
即将进入handleJobSubmitted方法,接下来的内容均非常重要,包括Stage划分和Task分发等。
Stage
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_], // 触发count算子的RDD
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// 这里开始划分Stage,内容非常多,我们在下面单独展开
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
...}
// 简单的封装,其中包含了上面的 finalStage
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
// 清除被持久化的RDD分区的位置
clearCacheLocs()
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
// 状态记录
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
// 正式提交Stage
submitStage(finalStage)
}
下面分别对createResultStage和submitStage进行追踪:
/**
* Create a ResultStage associated with the provided jobId.
*/
private def createResultStage(...): ResultStage = {
// 获得ResultStage的父stage,内部循环嵌套,下面会展开
val parents = getOrCreateParentStages(rdd, jobId)
// 经过上面的方法,该RDD所有父辈都被划分为不同的Stages,下面就是对仅剩的这个RDD封装为 ResultStage
// 获取一个自增ID,实例化 ResultStage,内部有字段和当前jobId绑定。因此ResultStage和job是一一对应
val id = nextStageId.getAndIncrement()
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
// 完成上面所有的过程就将 ResultStage返回
stage
}
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
// 遍历获取当前RDD的父依赖,
getShuffleDependencies(rdd).map {
shuffleDep =>
getOrCreateShuffleMapStage(shuffleDep, firstJobId)
}.toList
}
private[scheduler] def getShuffleDependencies(
rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
// 返回值容器
val parents = new HashSet