深入Spark内核：任务调度(2)-DAGScheduler

最新推荐文章于 2023-12-31 01:46:17 发布

0毛蛋0

最新推荐文章于 2023-12-31 01:46:17 发布

阅读量1.5k

点赞数

分类专栏： Spark 文章标签： spark 任务调度源码 DAGScheduler

本文链接：https://blog.csdn.net/qiuhaomaodan/article/details/44940775

版权

Spark 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

DAGScheduler面向stage的调度(stage-oriented scheduling)的高级调度器，将job根据类型划分为不同的stage，为每个job的不同stage计算DAG，跟踪哪些RDD和stage被物化并且发现运行job的最小的调度策略，然后并在每一个stage内产生一系列的task并封装为taskset, 结合当前的缓存情况来决定每个Task的最佳位置(任务在数据所在的节点上运行)，以taskset为单位提交(submitTasks)给TaskScheduler。此外还处理因shuffle输出丢失而导致的stage失败并重新提交先前的stage。

 * TaskScheduler implementation that runs them on the cluster.
 *
 * In addition to coming up with a DAG of stages, this class also determines the preferred
 * locations to run each task on, based on the current cache status, and passes these to the
 * low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being
 * lost, in which case old stages may need to be resubmitted. Failures *within* a stage that are
 * not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task
 * a small number of times before cancelling the whole stage.
 *
 */
private[spark]
class DAGScheduler(
    private[scheduler] val sc: SparkContext,
    private[scheduler] val taskScheduler: TaskScheduler,
    listenerBus: LiveListenerBus,
    mapOutputTracker: MapOutputTrackerMaster,
    blockManagerMaster: BlockManagerMaster,
    env: SparkEnv,
    clock: Clock = SystemClock)
  extends Logging

在初始化SparkContext的时候会配置一系列的内容，例如会查看内存的设置情况<本节主要讨论DAGScheduler的源码和实现原理，所以暂时不讨论设置conf和读取配置的过程>，然后就会创建和启动scheduler：

// Create and start the scheduler
  private[spark] var taskScheduler = SparkContext.createTaskScheduler(this, master)
  private val heartbeatReceiver = env.actorSystem.actorOf(
    Props(new HeartbeatReceiver(taskScheduler)), "HeartbeatReceiver")
  @volatile private[spark] var dagScheduler: DAGScheduler = _
  try {
    dagScheduler = new DAGScheduler(this)
  } catch {
    case e: Exception => throw
      new SparkException("DAGScheduler cannot be initialized due to %s".format(e.getMessage))
  }

  // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
  // constructor
  taskScheduler.start()

SparkContext创建和启动scheduler时首先基于master URL创建TaskScheduler的实例对象(由于还没有在DAGScheduler的主构造方法(primary constructor)中传入TaskScheduler的引用，所以此时的TaskScheduler还没有启动)，然后设置driver和executors之间的心跳，通过以下一行代码来创建DAGScheduler的实例对象。

dagScheduler = new DAGScheduler(this)

大家可能注意到在dagScheduler 对象被实例化前有这么一行代码：

@volatile private[spark] var dagScheduler: DAGScheduler = _

@volatile的注解表明dagScheduler是一个易失字段，也就是可以被多个线程同时更新，下面我们继续跟踪DAGScheduler的构造方法，在DAGScheduler的主构造方法(primary constructor)会创建DAGSchedulerActorSupervisor的actor的实例对象dagSchedulerActorSupervisor和TaskScheduler进行通信，例如Job的提交和撤销、stage和task的撤销等，然后会初始化eventProcessActor(eventProcessActor负责接收和发送DAG的消息)。

private val dagSchedulerActorSupervisor =
    env.actorSystem.actorOf(Props(new DAGSchedulerActorSupervisor(this)))

private[scheduler] var eventProcessActor: ActorRef = _

private def initializeEventProcessActor() {
    // blocking the thread until supervisor is started, which ensures eventProcessActor is
    // not null before any job is submitted
    implicit val timeout = Timeout(30 seconds)
    val initEventActorReply =
      dagSchedulerActorSupervisor ? Props(new DAGSchedulerEventProcessActor(this))
    eventProcessActor = Await.result(initEventActorReply, timeout.duration).
      asInstanceOf[ActorRef]
  }

  initializeEventProcessActor()

接下来，在SparkContext的runJob方法中dagScheduler对象会调用DAGScheduler中的runJob方法（SparkContext的runJob方法是spark的所有action的主要入口）。

/**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark. The allowLocal
   * flag specifies whether the scheduler can run the computation on the driver rather than
   * shipping it out to the cluster, for short actions like first().
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit) {
    if (dagScheduler == null) {
      throw new SparkException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    val start = System.nanoTime
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
      resultHandler, localProperties.get)
    logInfo(
      "Job finished: " + callSite.shortForm + ", took " + (System.nanoTime - start) / 1e9 + " s")
    rdd.doCheckpoint()
  }

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit,
      properties: Properties = null)
  {
    val waiter = submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties)
    waiter.awaitResult() match {
      case JobSucceeded => {}
      case JobFailed(exception: Exception) =>
        logInfo("Failed to run " + callSite.shortForm)
        throw exception
    }
  }

让我们继续跟踪DAGScheduler的runJob方法会发现一行关键代码用于提交Job(submitJob)。

val waiter = submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties)

/**
   * Submit a job to the job scheduler and get a JobWaiter object back. The JobWaiter object
   * can be used to block until the the job finishes executing or can be used to cancel the job.
   */
  def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit,
      properties: Properties = null): JobWaiter[U] =
  {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    eventProcessActor ! JobSubmitted(
      jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, properties)
    waiter
  }

在submitJob方法中除了获取RDD分区、JobId等行为外，最后一行代码eventProcessActor 实现了Job的提交，"!"表明这个Job的提交行为是通过Akka发送消息的模式完成的，所以JobSubmitted必须是一个样例类(case class)用于模式匹配。

eventProcessActor ! JobSubmitted(
      jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, properties)

private[scheduler] case class JobSubmitted(
    jobId: Int,
    finalRDD: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    allowLocal: Boolean,
    callSite: CallSite,
    listener: JobListener,
    properties: Properties = null)
  extends DAGSchedulerEvent

如果还记得最开始的dagSchedulerActorSupervisor对象，我们会发现Job的提交真正的实现是DAGSchedulerEventProcessActor类，其中recieve方法中模式匹配JobSubmitted后内部实现了dagScheduler.handleJobSubmitted对stage进行切分，完成了job到stage的转换，生成finalStage(每个job都有一个finalStage)。

val initEventActorReply =
      dagSchedulerActorSupervisor ? Props(new DAGSchedulerEventProcessActor(this))

private[scheduler] class DAGSchedulerEventProcessActor(dagScheduler: DAGScheduler)
  extends Actor with Logging {

  override def preStart() {
    // set DAGScheduler for taskScheduler to ensure eventProcessActor is always
    // valid when the messages arrive
    dagScheduler.taskScheduler.setDAGScheduler(dagScheduler)
  }

  /**
   * The main event loop of the DAG scheduler.
   */
  def receive = {
    case JobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite,
        listener, properties)

    case StageCancelled(stageId) =>
      dagScheduler.handleStageCancellation(stageId)

    case JobCancelled(jobId) =>
      dagScheduler.handleJobCancellation(jobId)

    case JobGroupCancelled(groupId) =>
      dagScheduler.handleJobGroupCancelled(groupId)

    case AllJobsCancelled =>
      dagScheduler.doCancelAllJobs()

    case ExecutorAdded(execId, host) =>
      dagScheduler.handleExecutorAdded(execId, host)

    case ExecutorLost(execId) =>
      dagScheduler.handleExecutorLost(execId)

    case BeginEvent(task, taskInfo) =>
      dagScheduler.handleBeginEvent(task, taskInfo)

    case GettingResultEvent(taskInfo) =>
      dagScheduler.handleGetTaskResult(taskInfo)

    case completion @ CompletionEvent(task, reason, _, _, taskInfo, taskMetrics) =>
      dagScheduler.handleTaskCompletion(completion)

    case TaskSetFailed(taskSet, reason) =>
      dagScheduler.handleTaskSetFailed(taskSet, reason)

    case ResubmitFailedStages =>
      dagScheduler.resubmitFailedStages()
  }

  override def postStop() {
    // Cancel any active jobs in postStop hook
    dagScheduler.cleanUpAfterSchedulerStop()
  }
}

private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      allowLocal: Boolean,
      callSite: CallSite,
      listener: JobListener,
      properties: Properties = null)
  {
    var finalStage: Stage = null
    try {
      // New stage creation may throw an exception if, for example, jobs are run on a
      // HadoopRDD whose underlying HDFS files have been deleted.
      finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite)
    } catch {
      case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }
    if (finalStage != null) {
      val job = new ActiveJob(jobId, finalStage, func, partitions, callSite, listener, properties)
      clearCacheLocs()
      logInfo("Got job %s (%s) with %d output partitions (allowLocal=%s)".format(
        job.jobId, callSite.shortForm, partitions.length, allowLocal))
      logInfo("Final stage: " + finalStage + "(" + finalStage.name + ")")
      logInfo("Parents of final stage: " + finalStage.parents)
      logInfo("Missing parents: " + getMissingParentStages(finalStage))
      val shouldRunLocally =
        localExecutionEnabled && allowLocal && finalStage.parents.isEmpty && partitions.length == 1
      if (shouldRunLocally) {
        // Compute very short actions like first() or take() with no parent stages locally.
        listenerBus.post(SparkListenerJobStart(job.jobId, Array[Int](), properties))
        runLocally(job)
      } else {
        jobIdToActiveJob(jobId) = job
        activeJobs += job
        finalStage.resultOfJob = Some(job)
        listenerBus.post(SparkListenerJobStart(job.jobId, jobIdToStageIds(jobId).toArray,
          properties))
        submitStage(finalStage)
      }
    }
    submitWaitingStages()
  }     <span style="font-family:Microsoft YaHei;font-size:14px;">
</span>

此时在handleJobSubmitted中会根据finalRDD创建finalstage(newStage)，如果该stage不为null则创建一个active job和清除本地cache，并判断运行状态是local模式还是集群模式，如果是local模式则本地运行Job(local模式用于本地调试)，如果是集群模式则submitStage(finalStage)。

// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
//注意：该RDD是final RDD，而不是一系列的RDD，用finalRDD来创建finalStage
//newStage操作对应会生成新的result stage或者shuffle stage：内部有一个isShuffleMap变量来标识该stage是shuffle or result 
finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite)

newStage方法产生的finalStage中其实已经包含了该stage的所有依赖的父Stage，通过getParentStages方法构建该stage的依赖关系，然后调用visit方法反向构建RDD的DAG图，遇到窄依赖就将依赖的RDD加入到stage，遇到宽依赖就切分并递归宽依赖的stage。

  /**
   * Create a Stage -- either directly for use as a result stage, or as part of the (re)-creation
   * of a shuffle map stage in newOrUsedStage.  The stage will be associated with the provided
   * jobId. Production of shuffle map stages should always use newOrUsedStage, not newStage
   * directly.
   */
  private def newStage(
      rdd: RDD[_],
      numTasks: Int,
      shuffleDep: Option[ShuffleDependency[_, _, _]],
      jobId: Int,
      callSite: CallSite)
    : Stage =
  {
    val id = nextStageId.getAndIncrement()
    val stage =
      new Stage(id, rdd, numTasks, shuffleDep, getParentStages(rdd, jobId), jobId, callSite)
    stageIdToStage(id) = stage
    updateJobIdStageIdMaps(jobId, stage)
    stage
  }

submitStage中getMissParentStages(stage)根据finalstage得到该stage的父stage(parent)，也就是RDD的依赖关系，生成parentStage是通过RDD的dependencies；如果依赖关系是宽依赖(ShuffleDependency)，则生成一个mapStage来作为finalStage的父stage(parent)；也就是说对于需要shuffle操作的job，会生成mapStage和finalStage进行处理；如果依赖关系是窄依赖，不会生成新的stage。

val missing = getMissingParentStages(stage).sortBy(_.id)

//根据final stage的parents找出所有的parent stage  
private def getMissingParentStages(stage: Stage): List[Stage] = {
    val missing = new HashSet[Stage]
    val visited = new HashSet[RDD[_]]
    // We are manually maintaining a stack here to prevent StackOverflowError
    // caused by recursively visiting
    val waitingForVisit = new Stack[RDD[_]]
    def visit(rdd: RDD[_]) {
      if (!visited(rdd)) {
        visited += rdd
        if (getCacheLocs(rdd).contains(Nil)) {
          for (dep <- rdd.dependencies) {
            dep match {
                //如果是ShuffleDependency，则新建一个shuffle的map stage，且该stage是可用的话则加入missing中</span>
                case shufDep: ShuffleDependency[_, _, _] => //ShuffleDependency
                val mapStage = getShuffleMapStage(shufDep, stage.jobId)
                if (!mapStage.isAvailable) {
                  missing += mapStage
                }
              case narrowDep: NarrowDependency[_] => //NarrowDependency
                waitingForVisit.push(narrowDep.rdd)
            }
          }
        }
      }
    }
    waitingForVisit.push(stage.rdd)
    while (!waitingForVisit.isEmpty) {
      visit(waitingForVisit.pop())
    }
    missing.toList
  }

如果当前stage没有任何依赖或者所有的依赖都已执行完，则调用submitMissingTasks提交stage，否则就继续递归调用submitStage即先执行完所有依赖的父stage（根据getMissParentStages方法得到的结果集降序来执行stage），最终导致submitMissingTasks的调用，从代码可以看DAGScheduler向TaskScheduler是以stage为单位提交任务，stage是以taskset为单位的，接下来就和TaskScheduler打交道了。

/** Submits stage, but first recursively submits any missing parents. */
  private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {      <pre name="code" class="java">        val missing = getMissingParentStages(stage).sortBy(_.id)
        logDebug("missing: " + missing)
        if (missing == Nil) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          submitMissingTasks(stage, jobId.get)
        } else {
          for (parent <- missing) {
            submitStage(parent)
          }
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id)
    }
  }