DAGScheduler实现了面向DAG的高层次调度,即将DAG中的各个RDD划分到不同的Stage。DAGScheduler可以通过计算将DAG中的一系统RDD划分到不同的Stage,然后构建这些Stage之间的父子关系,最后将每个Stage按照Partition切分为多个Task,并以Task集合(即TaskSet)的形式提交给底层的TaskScheduler。
所有的组件都通过向DAGScheduler投递DAGSchedulerEvent来使用DAGScheduler。DAGScheduler内部的DAGSchedulerEventProcessLoop将处理这此处DAGScheduler-Event,并调用DAGScheduler的不同方法。JobListener用于对作业中每个Task执行成功或失败进行监听,JobWaiter实现了JobListener并最终确定作业的成功与失败。在正式学习DAGScheduler之前,我们先来看看DAGScheduler所依赖的组件DAGSchedulerEventProcessLoop、JobListener及ActiveJob的实现。
1 JobListener与JobWaiter
JobListener定义了所有Job的监听器的接口规范,其定义如下:
//org.apache.spark.scheduler.JobListener
private[spark] trait JobListener {
def taskSucceeded(index: Int, result: Any): Unit
def jobFailed(exception: Exception): Unit
}
Job执行成功后将调用JobListener定义的taskSeucceeded方法,而在Job失败后调用JobListener定义的jobFailed方法。
JobListener有JobWaiter和ApproximateActionListener两个实现类。JobWaiter用于等待整个Job执行完毕,然后调用给定的处理函数对返回结果进行处理。ApproximateActionListener只对有单一返回结果的Action(如count()和非并行的reduce())进行监听。
2 ActiveJob详解
ActiveJob用来表示已经激活的Job,即被DAGScheduler接收处理的Job。
//org.apache.spark.scheduler.ActiveJob
private[spark] class ActiveJob(
val jobId: Int, //Job的身份标识
val finalStage: Stage, //Job的最下游Stage
val callSite: CallSite, //应用程序调用栈
val listener: JobListener, //监听当前Job的JobListener
val properties: Properties) { //包含了当前Job的调度、Job group、描述等属性的Properties
val numPartitions = finalStage match { //当前分区数量
case r: ResultStage => r.partitions.length
case m: ShuffleMapStage => m.rdd.partitions.length
}
val finished = Array.fill[Boolean](numPartitions)(false) //Boolean类型的数组,每个数组索引代表一个分区的任务是否执行完成
var numFinished = 0 //当前Job的所有任务中已完成任务的数量
}
3 DAGScheduler的组成
DAGScheduler本身包含很多成员,了解这些成员是深入理解DAGScheduler的前提。。DAGScheduler的成员包括以下:
- sc:SparkContext
- taskScheduler:TaskScheduler的引用
- listenerBus:LiveListenerBus
- mapOutputTracker:MapOutputTrackerMaster
- blockManagerMaster:BlockManagerMaster
- env:SparkEnv
- clock:时钟对象
- metricsSource:有关DAGScheduler的度量源(即DAGSchedulerSource)
- nextJobId:类型为AtomicInteger,用于生成下一个Job的身份标识
- numTotalJobs:总共提交的作业数量。numTotalJobs实际读取了nextJobId的当前值
- nextStageId:类型为AtomicInteger,用于生成下一个Stage的身份标识
- jobIdToStageIds:用于缓存JobId与StageId之间的映射关系。由于jobIdToStageIds的类型为HashMap[Int, HashSet[Int]],所以Job与Stage之间是一对多的关系
- stageIdToStage:用于缓存StageId与Stage之间的映射关系
- jobIdToActiveJob:Job的身份标识与激活的Job(即ActiveJob)之间的映射关系
- waitingStages:处于等待状态的Stage集合
- runningStages:处于运行状态的Stage集合
- failedStages:处于失败状态的Stage集合
- activeJobs:所有激活的Job的集合
- cacheLocs:缓存每个RDD的所有分区的位置信息。cacheLocs的数据类型是HashMap[Int, IndexedSeq[Seq[TaskLocation]]],所以每个RDD的分区按照分区号作为索引存储到IndexedSeq。由于RDD的每个分区作为一个Block以及存储体系复制因素,因此RDD的每个分区的Block可能存在于多个节点的BlockManger上,RDD每个分区的位置信息为TaskLocation的序列。
4 DAGScheduler提供的常用方法
4.1 clearCacheLocs
用于清空cacheLocs中缓存的各个RDD的所有分区的位置信息
private def clearCacheLocs(): Unit = cacheLocs.synchronized {
cacheLocs.clear()
}
4.2 updateJobIdStageIdMaps
用来更新Job的身份标识与Stage及其所有祖先的映射关系
private def updateJobIdStageIdMaps(jobId: Int, stage: Stage): Unit = {
@tailrec
def updateJobIdStageIdMapsList(stages: List[Stage]) {
if (stages.nonEmpty) {
val s = stages.head
s.jobIds += jobId
jobIdToStageIds.getOrElseUpdate(jobId, new HashSet[Int]()) += s.id
val parents: List[Stage] = getParentStages(s.rdd, jobId)
val parentsWithoutThisJobId = parents.filter { ! _.jobIds.contains(jobId) }
updateJobIdStageIdMapsList(parentsWithoutThisJobId ++ stages.tail)
}
}
updateJobIdStageIdMapsList(List(stage))
}
处理步骤如下:
- 1)将jobId添加到每个Stage的jobIds
- 2)将jobId和与每个Stage的id之间的映射关系更新到jobIdToStageIds
4.3 activeJobForStage
用于找到Stage的所有已经激活的Job的身份标识
private def activeJobForStage(stage: Stage): Option[Int] = {
val jobsThatUseStage: Array[Int] = stage.jobIds.toArray.sorted
jobsThatUseStage.find(jobIdToActiveJob.contains)
}
4.4 getCacheLocs
用于获取RDD各个分区的TaskLocation序列
private[scheduler]
def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = cacheLocs.synchronized {
if (!cacheLocs.contains(rdd.id)) {
val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) {
IndexedSeq.fill(rdd.partitions.length)(Nil)
} else {
val blockIds =
rdd.partitions.indices.map(index => RDDBlockId(rdd.id, index)).toArray[BlockId]
blockManagerMaster.getLocations(blockIds).map { bms =>
bms.map(bm => TaskLocation(bm.host, bm.executorId))
}
}
cacheLocs(rdd.id) = locs
}
cacheLocs(rdd.id)
}
执行步骤如下:
- 1)如果 cacheLocs中不包含RDD的对应的IndexedSeq[Seq[TaskLocation]],那么执行以下操作:
-
① 如果RDD的存储级别为NONE,那么构造一个空的IndexedSeq[Seq[TaskLocation]]返回
-
② 如果RDD的存储级别不为NONE,则首先构造RDD各个分区的RDDBlockId数组,然后调用BlockManagerMaster的getLocations方法获取数组中每个RDDBlockId所存储的位置信息(即BlockManager的身份标识BlockManagerId)序列,并封装为TaskLocation
-
③将RDD的id与第①或② 得到的IndexSeq[Seq[TaskLocation]]之间的映射关系放入cacheLocs
-
2)返回cacheLocs中缓存的RDD对应的IndexedSeq[Seq[TaskLocation]]
4.5 getPreferredLocsInternal
用于获取RDD的指定分区的偏好位置
private def getPreferredLocsInternal(
rdd: RDD[_],
partition: Int,
visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
if (!visited.add((rdd, partition))) { //避免对RDD的指定分区的重复访问
return Nil
}
val cached = getCacheLocs(rdd)(partition) //获取RDD的指定分区的位置信息
if (cached.nonEmpty) {
return cached
}
val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList //获取RDD指定分区的偏好位置
if (rddPrefs.nonEmpty) {
return rddPrefs.map(TaskLocation(_))
}
rdd.dependencies.foreach { //以窄依赖的RDD的同一分区的偏好位置作为当前RDD的此分区的偏好位置
case n: NarrowDependency[_] =>
for (inPart <- n.getParents(partition)) {
val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
if (locs != Nil) {
return locs
}
}
case _ =>
}
Nil //如果上述步骤未能获取RDD的指定分区的偏好位置,则返回Nil
}
4.6 handleExecutorAdded
用于将Executor的身份标识从failedEpoch中移除
private[scheduler] def handleExecutorAdded(execId: String, host: String) {
if (failedEpoch.contains(execId)) {
logInfo("Host added was in lost list earlier: " + host)
failedEpoch -= execId
}
submitWaitingStages()
}
4.7 executorAdded
用于向DAGScheduler的DAGScheduler-EventProcessLoop投递ExecutorAdded事件
def executorAdded(execId: String, host: String): Unit = {
eventProcessLoop.post(ExecutorAdded(execId, host))
}
5 DAGScheduler与Job的提交
5.1 提交Job
用户提交的Job首先会被转换为一系列的RDD,然后才交给DAGScheduler进行处理。DAGScheduler的runJob方法是这一过程的入口,其实现代码如下:
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
val awaitPermission = null.asInstanceOf[scala.concurrent.CanAwait]
waiter.completionFuture.ready(Duration.Inf)(awaitPermission)
waiter.completionFuture.value.get match {
case scala.util.Success(_) =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case scala.util.Failure(exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
- 1)生成运行Job的启动时间start
- 2)调用submit方法提交Job。由于执行Job的过程是异步的,因此submitJob方法将立即返回JobWaiter对象
- 3)利用JobWaiter等待Job处理完毕。如果Job执行成功,根据处理结果打印相应的日志。如果Job执行失败,除打印日志外,还将抛出引起Job失败的异常信息
DAGScheduler的submitJob方法用于提交Job,其实现代码如下:
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
- 1)调用RDD的partitions方法来获取当前Job的最大分区数maxPartitions。如果检查到有不存在的分区,那么抛出异常
- 2)生成下一个Job的jobId
- 3)如果Job的分区数据等于0,则创建一个totalTasks属性为0的JobWaiter并返回。根据JobWaiter的实现,JobWaiter的jobPromise将被设置为Success
- 4)如果Job的分区数量大于0,则创建真正等待Job执行完成的JobWaiter
- 5)向eventProcessActor(即DAGSchedulerEventProcessLoop)发送JobSubmitted事件
- 6)返回JobWaiter
5.2 处理Job的提交
根据DAGSchedulerEventProcessLoop的实现,DAGSchedulerEventProcessLoop接收到JobSubmitted事件后,将调用DAGScheduler的handleJobSubmitted方法。
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
submitStage(finalStage)
submitWaitingStages()
}
- 1)调用createResultStage方法创建ResultStage。创建Stage的过程可能发生异常(比如运行HadoopRDD上的任务所依赖的底层HDFS文件被删除了),当异常发生时需要主动调用 JobWaiter的jobFailed方法。createResultStage方法在创建ResultStage的过程中会引起创建一系列Stage的连锁反应,Job与这些Stage的关系将被放入jobIdToStageIds中
- 2)创建ActiveJob
- 3)调用clearCacheLocs方法清空cacheLocs
- 4)生成Job提交的时间
- 5)将jobId与刚创建的ActiveJob之间的对应关系放入jobIdToActiveJob中
- 6)将刚创建的ActiveJob放入activeJobs集合中
- 7)使ResultStage的_activeJob属性持有刚创建的ActiveJob
- 8)获取当前Job的所有Stage对应的StageInfo(即数组stageInfo)
- 9)向LiveListenerBus投递SparkListenerJobStart事件,进而引发所有关注此事件的监听器执行相应的操作
- 10)调用submitStage方法提交ResultStage
6 构建Stage
在Spark中,一个Job可能被划分为一到多个Stage。各个Stag之间存在着依赖关系,下游的Stage依赖于上游的Stage。Job中所有Stage的提交过程包括反射驱动与正向提交。
所谓的Stage的反向驱动,就是从最下游的ResultStage开始,由ResultStage驱动所有父Stage的执行,父Stage又驱动祖父Stage的执行,这个驱动过程 不断向祖先方向传递,直到最上游的Stage为止。而正向提交,就是前代Stage先于后代Stage将Task提交给TaskScheduler,祖父Stage先于父Stage将Task提交给TaskScheduler,父Stage先于子Stage将Task提交给TaskScheduler,ResultStage最后一个将Task提交给TaskScheduler。
6.1 创建ResultStage
newResultStage方法用于创建ResultStage
private def newResultStage(
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
val (parentStages: List[Stage], id: Int) = getParentStagesAndId(rdd, jobId)
val stage = new ResultStage(id, rdd, func, partitions, parentStages, jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
- 1)调用getParentStagesAndId方法获取所有父Stage的列表,父Stage主要是宽依赖(ShuffleDependency)对应的Stage
- 2)生成Stage的身份标识,并创建ResultStage
- 3)将ResultStage注册到stageIdToStage中
- 4)调用 updateJobIdStageIdMaps方法更新Job的身份标识与ResultStage及其所有祖先的映射关系
6.2 获取或创建父Stage列表
Spark中的Job可能包含一到多个Stage,这些Stage的划分是从ResultStage开始,从后往前边划分边创建。DAGScheduler的getParentStagesAndId方法用于获取给定RDD的所有父Stage,这些Stage将被分配 给JobId对应的Job。
private def getParentStagesAndId(rdd: RDD[_], firstJobId: Int): (List[Stage], Int) = {
val parentStages = getParentStages(rdd, firstJobId)
val id = nextStageId.getAndIncrement()
(parentStages, id)
}
调用getparentStages方法获取父Stage,同时获取id,返回元组。
private def getParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
val parents = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
val waitingForVisit = new Stack[RDD[_]]
def visit(r: RDD[_]) {
if (!visited(r)) {
visited += r
for (dep <- r.dependencies) {
dep match {
case shufDep: ShuffleDependency[_, _, _] =>
parents += getShuffleMapStage(shufDep, firstJobId)
case _ =>
waitingForVisit.push(dep.rdd)
}
}
}
}
waitingForVisit.push(rdd)
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
parents.toList
}
getShuffleMapStage方法用于获取或者创建ShuffleMapStage
private def getShuffleMapStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
shuffleToMapStage.get(shuffleDep.shuffleId) match {
case Some(stage) => stage
case None =>
getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
if (!shuffleToMapStage.contains(dep.shuffleId)) {
shuffleToMapStage(dep.shuffleId) = newOrUsedShuffleStage(dep, firstJobId)
}
}
val stage = newOrUsedShuffleStage(shuffleDep, firstJobId)
shuffleToMapStage(shuffleDep.shuffleId) = stage
stage
}
}
- 1)如果已经创建了 ShuffleDependency对应的ShuffleMapStage,则直接返回此ShuffleMapStage
- 2)否则调用 getAncestorShuffleDependencies方法找到所有还未创建过ShuffleMapStage的祖先ShuffleDependency,并调用newOrUsedShuffleStage方法创建ShuffleMapStage并注册。最后还会为当前ShuffleDependency调用方法newOrUsedShuffleStage创建ShuffleMapStage并注册。
private def newOrUsedShuffleStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
val rdd = shuffleDep.rdd
val numTasks = rdd.partitions.length
val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, rdd.creationSite)
if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
(0 until locs.length).foreach { i =>
if (locs(i) ne null) {
stage.addOutputLoc(i, locs(i))
}
}
} else {
logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
}
stage
}
- 1)创建ShuffleMapStage,具体如下:
-
① 获取ShuffleDependency的rdd属性,作为将要创建的ShuffleMapStage的rdd
-
② 调用rdd(即RDD)的partitions方法得到rdd的分区数组,此分区数组的长度即为要创建的ShuffleMapStage的numTasks(Task数量)。这说明了map任务数量与RDD的各个分区一一对应。
-
③ 调用 newShuffleMapStage方法获取要创建ShuffleMapStage的所有父Stage(即parents)
-
2)调用 mapOutputTracker(实际类型为MapOutputTrackerMaster)的containsShuffle方法查看是否已经存在shuffleId对应的MapStatus。如果MapOutputTrackerMaster中未缓存 shuffleId对应的MapStatus,那么调用MapOutputTrackerMaster的registerShuffle方法注册shuffleId与对应的MapStatus的映射关系。如果MapOutputTrackerMaster中缓存了shuffleId对应的MapStatus,将执行以下操作:
-
① 调用MapOutputTrackerMaster的getSerializedMapOutputStatuses从MapOutputTrackerMaster获取对MapStatus序列化后的字节数组
-
② 调用deserializeMapStatuses方法对上一步获得的字节数组进行序列化,得到MapStatus数组
-
③ 调用 ShuffleMapStage的addOutputLoc方法更新ShuffleMapStage的outputLocs
7 提交ResultStage
handleJobSubmitted方法中处理Job提交的最后一步是调用submitStage方法提交ResultStage
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage) //获取当前Stage对应的Job的ID
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) { //当前Stage未提交
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) { //不存在未提交的父Stage,那么提交当前Stage所有未提交的Task
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) { //提交未提交的父Stage
submitStage(parent)
}
waitingStages += stage
}
}
} else { //终止依赖于当前Stage的所有Job
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
getMissingParentStages方法用于获取当前Stage的所有未提交 父Stage
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
val waitingForVisit = new Stack[RDD[_]]
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
dep match {
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
missing.toList
}
判断Stage的未提交父Stage的条件如下:
- 1)Stage的RDD的分区中存在没有对应TaskLocation序列的分区,即调用getCacheLocs方法获取不到某个分区的TaskLocation序列,则说明当前Stage的某个上游ShuffleMapStage的某个分区任务未执行
- 2)Stage的上游ShuffleMapStage不可用
8 提交还未计算的Task
提交Task的入口是submitMissingTasks方法,此方法在Stage没有不可用的父Stage时,提交当前Stage还未提交的任务
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
stage.pendingPartitions.clear()
//找出当前Stage的所有分区中还没有完成计算的分区的索引
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
//获取ActiveJob的properties。properties包含了当前Job的调度、group、描述等属性信息
val properties = jobIdToActiveJob(jobId).properties
runningStages += stage
stage match { //启动对当前Stage的输出提交到HDFS的协调
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match { //获取还没有完成计算的每一个分区的偏好位置
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
val job = s.activeJob.get
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch { //如果发生任何异常,则调用Stage的makeNewStageAttempt方法开始一次新的Stage执行尝试
case NonFatal(e) =>
stage.makeNewStageAttempt(partitionsToCompute.size)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
//开始Stage的执行尝试
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
var taskBinary: Broadcast[Array[Byte]] = null
try {
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
taskBinary = sc.broadcast(taskBinaryBytes) //广播任务序列化对象
} catch {
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString, Some(e))
runningStages -= stage
return
case NonFatal(e) =>
abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage => //为ShuffleMapStage的每一个分区创建一个ShuffleMapTask
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.latestInfo.taskMetrics, properties)
}
case stage: ResultStage => //为ResultStage的每一个分区创建一个ResultTask
val job = stage.activeJob.get
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) { //调用TaskScheduler的submitTasks方法提交此批Task
logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
stage.pendingPartitions ++= tasks.map(_.partitionId)
logDebug("New pending partitions: " + stage.pendingPartitions)
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
} else { //没有创建任何Task,将当前Stage标记为完成
markStageAsFinished(stage, None)
val debugString = stage match {
case stage: ShuffleMapStage =>
s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})"
case stage : ResultStage =>
s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
}
logDebug(debugString)
}
}
9 DAGScheduler的调度流程
经过对DAGScheduler的学习,可用如下流程图表示其调度流程
- 记号①:表示应用程序通过对Spark API的调用,进行一系列RDD转换构建出RDD之间的依赖关系后,调用DAGScheduler的runJob方法将RDD及其血缘关系中的所有RDD传递给DAGSchududer进行调度
- 记号②:DAGScheduler的runJob方法实际通过调用DAGScheduler的submitJob方法向DAGSchedulerEventProcessLoop发送JobSubmitted事件。DAGSchedulerEventProcessLoop接收到JobSubmitted事件后,将JobSubmitted事件放入事件队列(eventQueue)
- 记号③:DAGSchedulerEventProcessLoop内部的轮询线程eventThread不断从事件队列(eventQueue)中获取DAGSchedulerEvent事件,并调用DAGSchedulerEventProcessLoop的doOnReceive方法对事件进行处理
- 记号④:DAGSchedulerEventProcessLoop的doOnReceive方法处理JobSubmitted事件时,将调用DAGScheduler的handleJobSubmitted方法。handleJobSubmitted方法将对RDD构建Stage及Stage之间的依赖关系。
- 记号⑤:DAGScheduler首先把最上游的Stage中的Task集合提交给TaskScheduler,然后逐步将下游的Stage中的Task集合提交给TaskScheduler,TaskScheduler将对Task集合进行调度。
10 Task执行结果的处理
DAGScheduler的taskEnded方法用于对Task执行的结果进行处理。对于ShuffleMapTask而言,需要将它的状态信息MapStatus追加到ShuffleMapStage的outputLocs缓存中;如果ShuffleMapStage的所有分区的ShuffleMapTask都执行成功了,那么将需要把ShuffleMapStage的outputLocs缓存中的所有MapStatus注册到MapOutputTrackerMaster的mapStatuses中,以便于下游Stage中的Task读取输入数据所在的位置信息;如果某个ShuffleMapTask都执行成功了,还需要唤醒下游Stage的执行。对于ResultTask而言,如果ResultStage中的所有ResulTask都执行成功了,则将ResultStage标记为成功,并通知JobWaiter对各个ResultTask的执行结果进行收集,然后根据应用程序的需要进行最终的处理(如打印到控制台、输出到HDFS)。
def taskEnded(
task: Task[_],
reason: TaskEndReason,
result: Any,
accumUpdates: Seq[AccumulatorV2[_, _]],
taskInfo: TaskInfo): Unit = {
eventProcessLoop.post(
CompletionEvent(task, reason, result, accumUpdates, taskInfo))
}
taskEnd方法向DAGSchedulerEventProcessLoop投递CompletionEvent事件。DAGSchedulerEventProcessLoop接收到CompletionEvent事件后,将调用DAGScheduler的handleTaskCompetion方法。handleTaskCompletion方法中实现了针对不同执行状态的处理
10.1 ResultTask的结果处理
handleTaskCompetion方法对执行完成的ResultTask的处理如下:
//org.apache.spark.scheduler.DAGScheduler
val stage = stageIdToStage(task.stageId)
event.reason match {
case Success =>
stage.pendingPartitions -= task.partitionId
task match {
case rt: ResultTask[_, _] =>
val resultStage = stage.asInstanceOf[ResultStage]
resultStage.activeJob match {
case Some(job) =>
if (!job.finished(rt.outputId)) {
updateAccumulators(event)
job.finished(rt.outputId) = true //对应分区的任务设置为完成状态
job.numFinished += 1 //将ActiveJob的已完成的任务数加一
if (job.numFinished == job.numPartitions) { //所有分区的任务都完成了
markStageAsFinished(resultStage) //将当前Stage标记为已完成
cleanupStateForJobAndIndependentStages(job)
//向LiveListenerBus投递SparkListenerJobEnd事件
listenerBus.post(
SparkListenerJobEnd(job.jobId, clock.getTimeMillis(), JobSucceeded))
}
try {
//由JobWaiter的resultHandler函数来处理Job中每个Task的执行结果
job.listener.taskSucceeded(rt.outputId, event.result)
} catch {
case e: Exception =>
job.listener.jobFailed(new SparkDriverExecutionException(e)) //由JobWaiter处理失败
}
}
case None =>
logInfo("Ignoring result from " + rt + " because its job has finished")
}
10.2 ShuflleMapTask的结果处理
handleTaskCompletion方法对执行完成的ShuffleMapTask的处理如下:
case smt: ShuffleMapTask =>
val shuffleStage = stage.asInstanceOf[ShuffleMapStage]
updateAccumulators(event)
val status = event.result.asInstanceOf[MapStatus]
val execId = status.location.executorId
logDebug("ShuffleMapTask finished on " + execId)
if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {
logInfo(s"Ignoring possibly bogus $smt completion from executor $execId")
} else { //将分区任务的MapStatus追加到Stage的outputLocs中
shuffleStage.addOutputLoc(smt.partitionId, status)
}
if (runningStages.contains(shuffleStage) && shuffleStage.pendingPartitions.isEmpty) {
markStageAsFinished(shuffleStage) //将ShuffleMapStage标记为完成
logInfo("looking for newly runnable stages")
logInfo("running: " + runningStages)
logInfo("waiting: " + waitingStages)
logInfo("failed: " + failedStages)
//将当前Stage的shuffleId和outpuLocs中的MapStatus注册到MapOutputTrackerMaster的mapStatues中
mapOutputTracker.registerMapOutputs(
shuffleStage.shuffleDep.shuffleId,
shuffleStage.outputLocInMapOutputTrackerFormat(),
changeEpoch = true)
clearCacheLocs()
if (!shuffleStage.isAvailable) {
logInfo("Resubmitting " + shuffleStage + " (" + shuffleStage.name +
") because some of its tasks had failed: " +
shuffleStage.findMissingPartitions().mkString(", "))
submitStage(shuffleStage) //有任务失败了,此时需要再次提交此ShuffleMapStage
} else {
if (shuffleStage.mapStageJobs.nonEmpty) {
val stats = mapOutputTracker.getStatistics(shuffleStage.shuffleDep)
for (job <- shuffleStage.mapStageJobs) {
markMapStageJobAsFinished(job, stats) //提交当前ShuffleMapStage的下游Stage
}
}
}
}
- 1)将Task的partitionId和MapStatus追加到Stage的outputLocs中
- 2)如果ShuffleMapStage中没有待计算的分区,那么调用DAGScheduler的markStageAsFinished方法,将ShuffleMapStage标记为完成,然后调用MapOutputTrackerMaster的registerMapOutputs方法,将当前Stage的shuffleId和outputLocs中的MapStatus注册到MapOutputTrackerMaster的mapStatues中
- 3)如果ShuffleMapStage的isAvailable方法返回true(即_numAvailableOutputs与numPartitions相等),那么说明所有任务执行成功了,这时调用MapOutputTracker的getStatistics方法获取Shuffle依赖的各个map任务输出Block大小的统计信息,并将ShuffleMapStage的_mapStageJobs属性中保存的各个AcitveJob标记为执行成功,最后调用submitWaitingChildStages方法,提交当前ShuffleMapStage的子Stage。