源码-DAGScheduler及Stage划分提交

最新推荐文章于 2021-06-29 22:19:28 发布

技术蚂蚁

最新推荐文章于 2021-06-29 22:19:28 发布

阅读量858

点赞数

分类专栏： Spark Spark源码

本文链接：https://blog.csdn.net/u011007180/article/details/52403716

版权

Spark 同时被 2 个专栏收录

88 篇文章 0 订阅

订阅专栏

Spark源码

15 篇文章 0 订阅

订阅专栏

DAGScheduler

DAGScheduler的主要任务是基于Stage构建DAG，决定每个任务的最佳位置

记录哪个RDD或者Stage输出被物化
面向stage的调度层，为job生成以stage组成的DAG，提交TaskSet给TaskScheduler执行
重新提交shuffle输出丢失的stage

每一个Stage内，都是独立的tasks，他们共同执行同一个computefunction，享有相同的shuffledependencies。DAG在切分stage的时候是依照出现shuffle为界限的。

DAGScheduler实例化

下面的代码是SparkContext实例化DAGScheduler的过程：

  @volatile private[spark] var dagScheduler: DAGScheduler = _
  try {
    dagScheduler = new DAGScheduler(this)
  } catch {
    case e: Exception => {
      try {
        stop()
      } finally {
        throw new SparkException("Error while constructing DAGScheduler", e)
      }
    }
  }

下面代码显示了DAGScheduler的构造函数定义中，通过绑定TaskScheduler的方式创建，其中次构造函数去调用主构造函数来将sc的字段填充入参：

private[spark]
class DAGScheduler(
    private[scheduler] val sc: SparkContext,
    private[scheduler] val taskScheduler: TaskScheduler,
    listenerBus: LiveListenerBus,
    mapOutputTracker: MapOutputTrackerMaster,
    blockManagerMaster: BlockManagerMaster,
    env: SparkEnv,
    clock: Clock = new SystemClock())
  extends Logging {

  def this(sc: SparkContext, taskScheduler: TaskScheduler) = {
    this(
      sc,
      taskScheduler,
      sc.listenerBus,
      sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],
      sc.env.blockManager.master,
      sc.env)
  }

  def this(sc: SparkContext) = this(sc, sc.taskScheduler)

作业提交与DAGScheduler操作

Action的大部分操作会进行作业（job）的提交，源码1.0版的job提交过程的大致调用链是：sc.runJob()–>dagScheduler.runJob–>dagScheduler.submitJob—>dagSchedulerEventProcessActor.JobSubmitted–>dagScheduler.handleJobSubmitted–>dagScheduler.submitStage–>dagScheduler.submitMissingTasks–>taskScheduler.submitTasks。
具体的作业提交执行期的函数调用为：

sc.runJob->dagScheduler.runJob->submitJob
DAGScheduler::submitJob会创建JobSummitted的event发送给内嵌类eventProcessActor（在源码1.4中，submitJob函数中，使用DAGSchedulerEventProcessLoop类进行事件的处理）
eventProcessActor在接收到JobSubmmitted之后调用processEvent处理函数
job到stage的转换，生成finalStage并提交运行，关键是调用submitStage
在submitStage中会计算stage之间的依赖关系，依赖关系分为宽依赖和窄依赖两种
如果计算中发现当前的stage没有任何依赖或者所有的依赖都已经准备完毕，则提交task
提交task是调用函数submitMissingTasks来完成
task真正运行在哪个worker上面是由TaskScheduler来管理，也就是上面的submitMissingTasks会调用TaskScheduler::submitTasks
TaskSchedulerImpl中会根据Spark的当前运行模式来创建相应的backend,如果是在单机运行则创建LocalBackend
LocalBackend收到TaskSchedulerImpl传递进来的ReceiveOffers事件
receiveOffers->executor.launchTask->TaskRunner.run

DAGScheduler的runJob函数

DAGScheduler.runjob最后把结果通过resultHandler保存返回。
这里DAGScheduler的runJob函数调用DAGScheduler的submitJob函数来提交任务：

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    val waiter = submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties)
    waiter.awaitResult() match {
      case JobSucceeded => {
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      }
      case JobFailed(exception: Exception) =>
        logInfo("Job %d failed: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        throw exception
    }
  }

作业提交的调度

在Spark源码1.4.0中，DAGScheduler的submitJob函数不再使用DAGEventProcessActor进行事件处理和消息通信，而是使用DAGSchedulerEventProcessLoop类实例eventProcessLoop进行JobSubmitted事件的post动作。
下面是submitJob函数代码：

  /**
   * Submit a job to the job scheduler and get a JobWaiter object back. The JobWaiter object
   * can be used to block until the the job finishes executing or can be used to cancel the job.
   */
  def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, properties))
    waiter
  }

当eventProcessLoop对象投递了JobSubmitted事件之后，对象内的eventThread线程实例对事件进行处理，不断从事件队列中取出事件，调用onReceive函数处理事件，当匹配到JobSubmitted事件后，调用DAGScheduler的handleJobSubmitted函数并传入jobid、rdd等参数来处理Job。
技术分享

handleJobSubmitted函数

Job处理过程中handleJobSubmitted比较关键，该函数主要负责RDD的依赖性分析，生成finalStage，并根据finalStage来产生ActiveJob。
在handleJobSubmitted函数源码中，给出了部分注释：

  private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      allowLocal: Boolean,
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
    var finalStage: Stage = null
    try {
      // New stage creation may throw an exception if, for example, jobs are run on a
      // HadoopRDD whose underlying HDFS files have been deleted.
      finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite)
    } catch {
      //错误处理，告诉监听器作业失败，返回....
      case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }
    if (finalStage != null) {
      val job = new ActiveJob(jobId, finalStage, func, partitions, callSite, listener, properties)
      clearCacheLocs()
      logInfo("Got job %s (%s) with %d output partitions (allowLocal=%s)".format(
        job.jobId, callSite.shortForm, partitions.length, allowLocal))
      logInfo("Final stage: " + finalStage + "(" + finalStage.name + ")")
      logInfo("Parents of final stage: " + finalStage.parents)
      logInfo("Missing parents: " + getMissingParentStages(finalStage))
      val shouldRunLocally =
        localExecutionEnabled && allowLocal && finalStage.parents.isEmpty && partitions.length == 1
      val jobSubmissionTime = clock.getTimeMillis()
      if (shouldRunLocally) {
        // 很短、没有父stage的本地操作，比如 first() or take() 的操作本地执行
        // Compute very short actions like first() or take() with no parent stages locally.
        listenerBus.post(
          SparkListenerJobStart(job.jobId, jobSubmissionTime, Seq.empty, properties))
        runLocally(job)
      } else {
        // collect等操作走的是这个过程，更新相关的关系映射，用监听器监听，然后提交作业
        jobIdToActiveJob(jobId) = job
        activeJobs += job
        finalStage.resultOfJob = Some(job)
        val stageIds = jobIdToStageIds(jobId).toArray
        val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
        listenerBus.post(
          SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
        // 提交stage
        submitStage(finalStage)
      }
    }
    // 提交stage
    submitWaitingStages()
  }

org.apache.spark.scheduler.DAGScheduler#handleJobSubmitted首先会根据RDD创建finalStage。finalStage，顾名思义，就是最后的那个Stage。然后创建job，最后提交。提交的job如果满足一下条件，那么它将以本地模式运行：

1）spark.localExecution.enabled设置为true 并且 2）用户程序显式指定可以本地运行并且 3）finalStage的没有父Stage 并且 4）仅有一个partition

3）和 4）的话主要为了任务可以快速执行；如果有多个stage或者多个partition的话，本地运行可能会因为本机的计算资源的问题而影响任务的计算速度。

要理解什么是Stage，首先要搞明白什么是Task。Task是在集群上运行的基本单位。一个Task负责处理RDD的一个partition。RDD的多个patition会分别由不同的Task去处理。当然了这些Task的处理逻辑完全是一致的。这一组Task就组成了一个Stage。有两种Task：

org.apache.spark.scheduler.ShuffleMapTask
org.apache.spark.scheduler.ResultTask

ShuffleMapTask根据Task的partitioner将计算结果放到不同的bucket中。而ResultTask将计算结果发送回Driver Application。一个Job包含了多个Stage，而Stage是由一组完全相同的Task组成的。最后的Stage包含了一组ResultTask。

在用户触发了一个action后，比如count，collect，SparkContext会通过runJob的函数开始进行任务提交。最后会通过DAG的event processor 传递到DAGScheduler本身的handleJobSubmitted，它首先会划分Stage，提交Stage，提交Task。至此，Task就开始在运行在集群上了。

一个Stage的开始就是从外部存储或者shuffle结果中读取数据；一个Stage的结束就是由于发生shuffle或者生成结果时。

创建finalStage

handleJobSubmitted 通过调用newStage来创建finalStage：

finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite)

创建一个result stage，或者说finalStage，是通过调用org.apache.spark.scheduler.DAGScheduler#newStage完成的；而创建一个shuffle stage，需要通过调用org.apache.spark.scheduler.DAGScheduler#newOrUsedStage（1.6叫newOrUsedShuffleStage）。

private def newStage( //1.6版本叫newResultStage
rdd: RDD[_],
numTasks: Int,
shuffleDep: Option[ShuffleDependency[_, _, _]],
jobId: Int,
callSite: CallSite)
: Stage =
{
val id = nextStageId.getAndIncrement()
val stage =
new Stage(id, rdd, numTasks, shuffleDep, getParentStages(rdd, jobId), jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}