spark源码学习（二）- DAGSchedular 划分job,提交stage的过程

最新推荐文章于 2024-07-02 10:08:26 发布

beTree_fc

最新推荐文章于 2024-07-02 10:08:26 发布

阅读量609

点赞数

分类专栏： spark源码文章标签： spark 源码 DAGSchedular 划分Stage

本文链接：https://blog.csdn.net/u013560925/article/details/79645507

版权

本文详细介绍了Spark中DAGScheduler的工作流程，从SparkContext.runJob开始，经过DAGScheduler的runJob、submitJob方法，然后在EventLoop中处理任务提交。DAGScheduler会根据依赖关系划分Stage，并通过submitStage和submitMissingTasks方法提交Task。最后，DAGScheduler获取Task的首选执行位置，确保高效执行。

摘要由CSDN通过智能技术生成

背景

了解dagSchedular提交job，就需要了解什么是job,什么是stage,如果我们写了一段程序，其中调用了多个spark算子，但是我们知道，实际在计算的时候，只有在遇到action算子的时候，才会触发计算操作，而这个计算操作就是一个job，所以说一次action操作就会触发提交一个job，比如collect和first操作都会触发sparkcontext的runjob提交job的操作，代码如下：

def collect(): Array[T] = withScope {
    //调用的sparkcontext的runjob方法
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

stage具体划分方式取决于依赖的类型，宽依赖还是窄依赖，一个job触发往往会包含一个或者多个stage，这些stage都会包含一个driver生成的一个jobId,具体的sparkcontext对象的runjob的源码，如下文所叙述。

附带Application，Driver，Job，Task，Stage介绍的一篇文章：点击打开链接

https://www.cnblogs.com/superhedantou/p/5699201.html

过程

1.sparkcontext.runjob 方法

方法主要调用dagschedular的runjob方法，将处理任务转至dagschedular进行上层任务调度阶段的处理，主要步骤包含：
（1）清理闭包（文末附带了介绍文章链接）
（2）dagScheduler.runJob

（3）rdd.doCheckpoint()

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    //核心代码
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())

    //任务完成磁盘存储
    rdd.doCheckpoint()
  }

2.dagScheduler.runJob 方法

方法是调用submitJob方法提交任务，并等待结果处理，

  def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    
    //提交job 其实是提交rdd和func任务
    //submit其实是向另外一个loop消息线程提交，而不是网络提交
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    
    //结果处理
    ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
    waiter.completionFuture.value.get match {
      case scala.util.Success(_) =>
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      case scala.util.Failure(exception) =>
        logInfo("Job %d failed: