Spark作业调度

最新推荐文章于 2022-06-17 20:27:18 发布

天然呆的技术博客

最新推荐文章于 2022-06-17 20:27:18 发布

阅读量482

点赞数

分类专栏： Spark技术研究

本文链接：https://blog.csdn.net/u013494310/article/details/51441898

版权

Spark技术研究专栏收录该内容

10 篇文章 1 订阅

订阅专栏

1.生成finalStage

finalStage是根据RDD依赖关系(广度优先)回溯形成的一种ResultStage，内部包含了当前stage的父stage，shuffleDependecy优先进行回溯，如果当前rdd含有 shuffleDependecy则继续回溯，直到回溯到最初始的RDD，然后形成stage,如果期间有窄依赖则将RDD并入到当前stage直到遇到shuffleDependecy则断开

var finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
private def getParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
val parents = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
  // caused by recursively visiting
val waitingForVisit = new Stack[RDD[_]]
def visit(r: RDD[_]) {
if (!visited(r)) {
      visited += r
// Kind of ugly: need to register RDDs with the cache here since
      // we can't do it in its constructor because # of partitions is unknown
//获取当前RDD的依赖集合
for (dep <- r.dependencies) {
        dep match {
//ShuffleDependency优先执行
case shufDep: ShuffleDependency[_, _, _] =>
            parents += getShuffleMapStage(shufDep, firstJobId)
case _ =>
//非ShuffleDependency加入等待集合，后进先出的执行
            waitingForVisit.push(dep.rdd)
        }
      }
    }
  }
  waitingForVisit.push(rdd)
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
  }
  parents.toList
}
2.根据finalStage生成ActiveJob
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
3.发送事件给ListenerBus
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
submitStage(finalStage)
4.提交finalStage
private def submitStage(stage: Stage) {
    //获取父stage，并排序
    val missing = getMissingParentStages(stage).sortBy(_.id)
    if (missing.isEmpty) {
//如果当前Stage没有父stage，则立即执行
      submitMissingTasks(stage, jobId.get)
    } else {
//迭代父stage并重复此过程
for (parent <- missing) {
submitStage(parent)
      }
//将当前stage加入最后执行队列
waitingStages += stage
    }
}
4.1 获取任务本地性
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
  stage match {
case s: ShuffleMapStage =>
      partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
val job = s.activeJob.get
      partitionsToCompute.map { id =>
val p = s.partitions(id)
        (id, getPreferredLocs(stage.rdd, p))
      }.toMap
  }
} catch {
case NonFatal(e) =>
    stage.makeNewStageAttempt(partitionsToCompute.size)
    listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
    abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
runningStages -= stage
return
}
4.2 序列化rdd和Stage的依赖，并进行广播
var taskBinary: Broadcast[Array[Byte]] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
  // For ResultTask, serialize and broadcast (rdd, func).
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array()
case stage: ResultStage =>
closureSerializer.serialize((stage.rdd, stage.func): AnyRef).array()
  }

  taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
    abortStage(stage, "Task not serializable: " + e.toString, Some(e))
runningStages -= stage

// Abort execution
return
  case NonFatal(e) =>
    abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}", Some(e))
runningStages -= stage
return
}
4.3 根据stage的类型生成Task
val tasks: Seq[Task[_]] = try {
  stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
          taskBinary, part, locs, stage.internalAccumulators)
      }

case stage: ResultStage =>
val job = stage.activeJob.get
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
          taskBinary, part, locs, id, stage.internalAccumulators)
      }
  }
} catch {
case NonFatal(e) =>
    abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
runningStages -= stage
return
}
4.4 向TaskScheduler提交TaskSet
taskScheduler.submitTasks(new TaskSet(
  tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))

天然呆的技术博客

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark作业调度

1.生成finalStagefinalStage是根据RDD依赖关系(广度优先)回溯形成的一种ResultStage，内部包含了当前stage的父stage，shuffleDependecy优先进行回溯，如果当前rdd含有shuffleDependecy则继续回溯，直到回溯到最初始的RDD，然后形成stage,如果期间有窄依赖则将RDD并入到当前stage直到遇到shuffleDepe
复制链接

扫一扫