零、前置
上一章分享了reduceByKey方法,发现transformation操作在最后只会将具体的操作记录到rdd中而并不会实际执行,函数的实际执行会延迟到spark解析到action类型操作才会触发。action类型的操作中会调用runJob将job提交到listenerBus中供listenerBus调度。本章就来详细地跟读一下runJob方法。
在跟读完本章的源码后,我们可以验证两个问题:
1、task的数量和partition的数量到底是什么关系
2、stage的划分依据到底是什么
一、预备阶段
以RDD的action类型的操作——reduce为例,我们来详细地跟读一下action类型操作的job提交过程。源码先上:
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T = withScope {
val cleanF = sc.clean(f)
val reducePartition: Iterator[T] => Option[T] = iter => { ------ 1)
if (iter.hasNext) {
Some(iter.reduceLeft(cleanF))
} else {
None
}
}
var jobResult: Option[T] = None
val mergeResult = (index: Int, taskResult: Option[T]) => { ------ 2)
if (taskResult.isDefined) {
jobResult = jobResult match {
case Some(value) => Some(f(value, taskResult.get))
case None => taskResult
}
}
}
sc.runJob(this, reducePartition, mergeResult) ------ 3)
// Get the final result out of our Option, or throw an exception if the RDD was empty
jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}
reduce方法的作用这里不再赘述。该方法首先定义了两个函数,reducePartition函数用于迭代地对iter中的每一条记录调用cleanF调用的方式类似于B2 = cleanF(A1,A2), B3 = cleanF(A3, B2), B4 = cleanF(A4, B3) …… BN = cleanF(AN, B(N-1))……(reduce操作中,两两合并实际上就是这样啦),另一个函数mergeResult用来将任务的结果进行合并。(这些都不是重点,记住函数大概干嘛用就行了,下边会用到)
紧接着,reduce会调用SparkContext中的runJob,源码:
/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
SparkContext的runJob方法主要是做一些前期的工作,比如:判断当前的SparkContext是不是已经被用户stop掉了;获取用户的调用信息(callSite保存的是用户的调用栈信息,用户也可以自己设置一个func来返回其他的信息);对传入的闭包作clean等。以及一些善后的工作,比如更新stage的进度条以及记录Checkpoint等。接着,SparkContext中的runJob会调用dagScheduler的runJob方法,dagScheduler中的runJob方法才是实际提交job的地方。(这里的DAG值的是有向无环图,Spark中rdd之间存在依赖关系,rdd之间的依赖关系构成了一个dag图,dagScheduler会通过分析rdd依赖关系的dag图来判断stage之间的执行顺序,这块在后续的文章中会详细介绍)。
接下来看一下DAGScheduler的runJob方法:
/**
* Run an action job on the given RDD and pass all the results to the resultHandler function as
* they arrive.
*
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @param partitions set of partitions to run on; some jobs may not want to compute on all
* partitions of the target RDD, e.g. for operations like first()
* @param callSite where in the user program this job was called
* @param resultHandler callback to pass each result to
* @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
*
* @throws Exception when the job fails
*/
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties) ------1)
// Note: Do not call Await.ready(future) because that calls `scala.concurrent.blocking`,
// which causes concurrent SQL executions to fail if a fork-join pool is used. Note that
// due to idiosyncrasies in Scala, `awaitPermission` is not actually used anywhere so it's
// safe to pass in null here. For more detail, see SPARK-13747.
val awaitPermission = null.asInstanceOf[scala.concurrent.CanAwait]
waiter.completionFutu