上一篇文章中,当sparkContext初始化完成,并且worker也反向注册回来之后,程序代码开始运行,当遇到一个action操作的时候,该action函数会调用runjob函数,runjob最周会调用到DAGScheduler。本文将分析下面的DAGScheduler将一个job划分成若干个stage,每个stage又会分成若干个task,对与每个partition都会启动一个task,每个task会在最佳位置运行。后面将taskset交给TaskScheduler,TaskScheduler将每个task发送到集群去执行。具体来说,stage的划分是根据RDD之间的依赖进行的,一个action之前的所有操作会被当成一个DAG,系统从后向前,当发现一个宽依赖的时候就划分成一个stage。即DAG有n个宽依赖,就会划分成n+1个stage,可以在4040端口观察此划分。比如shuffle操作就是一个宽依赖。下面分析源代码:在DAGScheduler的runJob函数中,调用了submitJob函数。
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
waiter.awaitResult() match {
case JobSucceeded =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case JobFailed(exception: Exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
// SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
跟踪submitJob方法:
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
最后调用了handleJobSubmitted:首先将后面的RDD创建成一个stage,然后生成一个job,并将job加入缓存中。该方法中有两个重要的方法:getMissingParentStages和submitStage。
private[scheduler] def handleMapStageSubmitted(jobId: Int,
dependency: ShuffleDependency[_, _, _],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
// Submitting this map stage might still require the creation of some parent stages, so make
// sure that happens.
var finalStage: ShuffleMapStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
//将最后一个rdd作为finalstage
finalStage = getShuffleMapStage(dependency, jobId)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
logInfo("Got map stage job %s (%s) with %d output partitions".format(
jobId, callSite.shortForm, dependency.rdd.partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.addActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
//该方法将会导致第一个stage提交,并且将其他stage放入waitingStages中
submitStage(finalStage)
// If the whole stage has already finished, tell the listener and remove it
if (finalStage.isAvailable) {
markMapStageJobAsFinished(job, mapOutputTracker.getStatistics(dependency))
}
//提交第一个stage外的所有stage
submitWaitingStages()
}
submitstage方法:
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
//获取父stage
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
// 如果不存在父stage则调用submitMissingTasks方法
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
//否则 继续递归调用 获取父stage
for (parent <- missing) {
//递归调用
submitStage(parent)
}
//将父stage加入到waitingStages这个hashset里
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
stage的划分最重要的方法:首先将该stage的最后一个rdd作为finalstage,然后遍历该rdd的所有依赖,如果前面的依赖是窄依赖,那么将该rdd所依赖的rdd入栈;如果是宽依赖,则重新生成一个stage。由此可以看出,一个job的划分是根据该动作是action还是transformation操作,遇到action,则划分成一个job;一个stage的划分是根据依赖关系:遇到宽依赖就会生成一个新的stage。
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new Stack[RDD[_]]
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
dep match {
//如果是宽依赖 那么创建一个新的stage
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
//如果是窄依赖 那么将该rdd所依赖的前一个rdd入栈
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
missing.toList
}
stage的划分总结:
第一步:将最后一个RDD作为finalstage。
第二步:遍历rdd的父依赖,如果是宽依赖,生成新的stage;如果是窄依赖,入栈。
第三步:递归,遍历stage,先提交父stage。
for example ,对于有shuffle操作的RDD,比如groupbykey,reduceByKey,countByKey,会对应三个RDD,mapPartitionsRDD->shuffleRDD->mapPartitionsRDD. mapPartitionsRDD->shuffleRDD这个操作是个宽依赖操作,所以这个过程会分为两个stage,mapPartitionsRDD是一个stage,后面两个RDD会被划分到另一个RDD内。