spark源码分析之DAGScheduler提交作业(job)过程、stage阶段说明

1.DAGScheduler.scala

主要作用:
1.DAGScheduler为每一个job中计算出针对每一个stage的DAG,同时追踪与之对应的rdd,每一个阶段输出进行物化(输出保存到磁盘),同时找到一条运行spark job最有路径,根据是否有cache、是否有checkPoint。完毕之后将每一个阶段作为一个TaskSet责成内部在创建DAGScheduler对象的时候传入尽量的TaskScheduler的实现类,让其发送给executor中进行执行。
2.另外,针对每一个阶段的DAG,DAGScheduler类会找到一个最有路径去运行相关的task,基于是否cache、是否已经传递给taskScheduler。由于shuffle输出文件丢失造成的stage失败,spark会将上一阶段的运行重新进行提交。如果实在一个stage内部发生的失败,我们spark会在取消整个阶段运行之前,消耗一点时间重试每一个task。
重要角色:
EventLoop:事件队列。所提交的job放入EventLoop队列中。
ListenerBus:监听总线
提交job  runJob():运行job,是由SparkContext中runJob方法中调用的。

def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    //提交一个job,返回值为当前运行job的状态
    val waiter = submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties)

    waiter.awaitResult() match {
      case JobSucceeded => {//job运行成功
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      }//job运行失败
      case JobFailed(exception: Exception) =>
        logInfo("Job %d failed: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        throw exception
    }
  }

分析提交job的方法:submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties)

主要作用:向job scheduler提交job,并返回JobWaiter对象,直到job完成执行或者被取消后才能够被放行,也就是说被放行后才能够使用这个对象。(翻译原句:The JobWaiter object can be used to block until the the job finishes executing or can be used to cancel the job.)。

  def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    /**
    * 获取当前rdd指向的所有的partition
    */ 确保提交的job task到确实存在的partition上
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {//如果当前partition中没数据,或者rdd对应的partition的个数为0,结束当前提交作业,状态为JobSuccessed
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]

    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)

    /**
      * 将提交的job放置到了eventProcessLoop队列中,
      * 内部创建了一个DAGSchedulerEventProcessLoop 继承EventLoop,用于将提交的job作业发送到队列中,在其中设置一个
      * 工作线程,不断从队列中获取要执行的job
      */
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, properties))
    waiter
  }
解释为什么partitions.size==0为true时,状态为JobSuccessed:
 if (partitions.size == 0) {//如果当前partition中没数据,或者rdd对应的partition的个数为0,结束当前提交作业,状态为JobSuccessed
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }
查看JobWaiter类发现:针对为0个job,将直接设置jobResult为 JobSuccessed
private[spark] class JobWaiter[T](
    dagScheduler: DAGScheduler,
    val jobId: Int,
    totalTasks: Int,
    resultHandler: (Int, T) => Unit)
  extends JobListener {

  private var finishedTasks = 0

  // Is the job as a whole finished (succeeded or failed)?
  @volatile
  private var _jobFinished = totalTasks == 0

  def jobFinished = _jobFinished

  // If the job is finished, this will be its result. In the case of 0 task jobs (e.g. zero
  // partition RDDs), we set the jobResult directly to JobSucceeded.
  private var jobResult: JobResult = if (jobFinished) JobSucceeded else null
内部创建的DAGSchedulerEventProcessLoop继承自EventLoop[DAGSchedulerEvent],EventLoop主要作用:

Event-loop(事件环)接受DAGScheduler将job作业发送到Event内部的队列,启动一个event线程去执行这些所有的event事件。

注意:这个event队列会无限制的增长下去,所以子类必须要在OnReceive方法中在事件内来处理相应的event事件,以免造成OOM异常。

总结:job放入eventProcessLoop,在其父类 EventLoop中,开启守护线程,并从eventQuequ中取事件,再在子类eventProcessLoop的onReceive(event)中处理,以handleJobSubmitted为例。
开始stage阶段划分
每一个宽依赖划分一个阶段,

 private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      allowLocal: Boolean,
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
    var finalStage: Stage = null
    try {
      // New stage creation may throw an exception if, for example, jobs are run on a
      // HadoopRDD whose underlying HDFS files have been deleted.
      //1.new stage 查看stage如何创建
      finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite)
    } catch {
      case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }
    if (finalStage != null) {
      //2.在DAGScheduler跟踪active job的信息
      val job = new ActiveJob(jobId, finalStage, func, partitions, callSite, listener, properties)
      clearCacheLocs()
      logInfo("Got job %s (%s) with %d output partitions (allowLocal=%s)".format(
        job.jobId, callSite.shortForm, partitions.length, allowLocal))
      logInfo("Final stage: " + finalStage + "(" + finalStage.name + ")")
      logInfo("Parents of final stage: " + finalStage.parents)
      logInfo("Missing parents: " + getMissingParentStages(finalStage))
      val shouldRunLocally =
        localExecutionEnabled && allowLocal && finalStage.parents.isEmpty && partitions.length == 1
      val jobSubmissionTime = clock.getTimeMillis()
      if (shouldRunLocally) {
        // Compute very short actions like first() or take() with no parent stages locally.
        listenerBus.post(
          SparkListenerJobStart(job.jobId, jobSubmissionTime, Seq.empty, properties))
        runLocally(job)
      } else {
        jobIdToActiveJob(jobId) = job
        activeJobs += job
        finalStage.resultOfJob = Some(job)
        val stageIds = jobIdToStageIds(jobId).toArray
        val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))

        /**3.
          * 向监听器总线中的队列中增加一个监听事件
          * 该监听总线是一个异步的处理注册的监听事件
          * Asynchronously passes events to registered listeners.
          * 只有在被调用start方法之后,才对将该事件绑定到一个监听器上面,否则就只是简单的将该event事件buffer到队列中,
          * 如果调用stop方法,则会将监听器中所有的事件都移除。
          * Until `start()` is called, all posted events are only buffered. Only after this listener bus
          * has started will events be actually propagated to all attached listeners. This listener bus
          * is stopped when `stop()` is called, and it will drop further events after stopping.
          *
	  * 默认监听事件capacity为10000,如果增加的比drained(流走)的快,将报明确的错误
          * 查看start(),从队列中取事件postToAll(event),这个应保证在同一个线程中,将事件与监听器绑定,
          */
        listenerBus.post(
          SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))

        submitStage(finalStage)
      }
    }
    submitWaitingStages()
  }

1.new stage如何创建

创建一个Stage---我们要么直接作为一个ResultStage,针对最后一个stage,要么使用neworUsedStage来创建shuffle map阶段的stage,而不是直接使用newStage(主要创建finalStage),通过jobId将stage和job关联起来。

private def newStage(
      rdd: RDD[_],
      numTasks: Int,
      shuffleDep: Option[ShuffleDependency[_, _, _]],
      jobId: Int,
      callSite: CallSite)
    : Stage =
  {
    //1-1获得父stage
    val parentStages = getParentStages(rdd, jobId)
    val id = nextStageId.getAndIncrement()
    val stage = new Stage(id, rdd, numTasks, shuffleDep, parentStages, jobId, callSite)
    stageIdToStage(id) = stage
    updateJobIdStageIdMaps(jobId, stage)
    stage
  }
1-1 getParentStages(rdd, jobId)

获取当前stage的parent stage,同时将该stage绑定到该job。在操作过程当中,我们要手动维护栈内存,尽量去避免由于递归调用造成的StackOverflowError

private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = {
    val parents = new HashSet[Stage]
    val visited = new HashSet[RDD[_]]

    // We are manually maintaining a stack here to prevent StackOverflowError
    // caused by recursively visiting
    val waitingForVisit = new Stack[RDD[_]]
    def visit(r: RDD[_]) {
      if (!visited(r)) {
        visited += r
        //注意:如果要进行注册的RDD是带有缓存的,我们不能在这个地方进行操作,因为持久化之后其对应的partition状态unknown
        // Kind of ugly: need to register RDDs with the cache here since
        // we can't do it in its constructor because # of partitions is unknown
        for (dep <- r.dependencies) {
          dep match {
            case shufDep: ShuffleDependency[_, _, _] =>
              parents += getShuffleMapStage(shufDep, jobId)
            case _ =>
              waitingForVisit.push(dep.rdd)
          }
        }
      }
    }
    waitingForVisit.push(rdd)
    while (!waitingForVisit.isEmpty) {
      visit(waitingForVisit.pop())
    }
    parents.toList//转换为list列表返回
  }

再查看Stage.scala源码简介:一个stage就是spark job的一个运行部分,是由一组独立的task组成的用于计算相同功能的task,其中这些所有的task使用一个共同的宽依赖。每一个taskset对应的DAG,会在当shuffle出现的时候,被scheduler切分成一个个stage,由DAGScheduler按照stage的拓扑顺序去执行这些stage。

Stage分为两类: ShuffleMapStage:是下游stage的输入,需要去跟踪每一个节点上面的每一个分区shuffleMapStage的输出,ResultStage:连接着action操作,初始化一个job

每一个stage都会绑定一个jobId,在第一个提交job的阶段就能够确定是哪一个job提交的。当采用的FIFO的调度模式,当前较早阶段可能先执行计算操作,如果计算失败还可以进行快速的恢复。


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值