2. spark源码学习分享:DAGScheduler.runJob

本文详细介绍了Spark中的DAGScheduler如何运行Job,从预备阶段到提交Job,再到处理JobSubmitted消息的过程。通过源码学习,探讨task数量与partition的关系以及stage划分的依据。
摘要由CSDN通过智能技术生成

        

    零、前置

     

    上一章分享了reduceByKey方法,发现transformation操作在最后只会将具体的操作记录到rdd中而并不会实际执行,函数的实际执行会延迟到spark解析到action类型操作才会触发。action类型的操作中会调用runJob将job提交到listenerBus中供listenerBus调度。本章就来详细地跟读一下runJob方法。

    在跟读完本章的源码后,我们可以验证两个问题:

    1、task的数量和partition的数量到底是什么关系

    2、stage的划分依据到底是什么

    

    一、预备阶段

    
     以RDD的action类型的操作——reduce为例,我们来详细地跟读一下action类型操作的job提交过程。源码先上:
  /**
   * Reduces the elements of this RDD using the specified commutative and
   * associative binary operator.
   */
  def reduce(f: (T, T) => T): T = withScope {
    val cleanF = sc.clean(f)
    val reducePartition: Iterator[T] => Option[T] = iter => {       ------ 1)
      if (iter.hasNext) {
        Some(iter.reduceLeft(cleanF))
      } else {
        None
      }
    }
    var jobResult: Option[T] = None
    val mergeResult = (index: Int, taskResult: Option[T]) => {      ------ 2)
      if (taskResult.isDefined) {
        jobResult = jobResult match {
          case Some(value) => Some(f(value, taskResult.get))
          case None => taskResult
        }
      }
    }
    sc.runJob(this, reducePartition, mergeResult)                   ------ 3)
    // Get the final result out of our Option, or throw an exception if the RDD was empty
    jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
  }

    reduce方法的作用这里不再赘述。该方法首先定义了两个函数,reducePartition函数用于迭代地对iter中的每一条记录调用cleanF调用的方式类似于B2 = cleanF(A1,A2), B3 = cleanF(A3, B2), B4 = cleanF(A4, B3) …… BN = cleanF(AN, B(N-1))……(reduce操作中,两两合并实际上就是这样啦),另一个函数mergeResult用来将任务的结果进行合并。(这些都不是重点,记住函数大概干嘛用就行了,下边会用到)
    紧接着,reduce会调用SparkContext中的runJob,源码:
  /**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }
    
    SparkContext的runJob方法主要是做一些前期的工作,比如:判断当前的SparkContext是不是已经被用户stop掉了;获取用户的调用信息(callSite保存的是用户的调用栈信息,用户也可以自己设置一个func来返回其他的信息);对传入的闭包作clean等。以及一些善后的工作,比如更新stage的进度条以及记录Checkpoint等。接着,SparkContext中的runJob会调用dagScheduler的runJob方法,dagScheduler中的runJob方法才是实际提交job的地方。(这里的DAG值的是有向无环图,Spark中rdd之间存在依赖关系,rdd之间的依赖关系构成了一个dag图,dagScheduler会通过分析rdd依赖关系的dag图来判断stage之间的执行顺序,这块在后续的文章中会详细介绍)。


    接下来看一下DAGScheduler的runJob方法:
  /**
   * Run an action job on the given RDD and pass all the results to the resultHandler function as
   * they arrive.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   *   partitions of the target RDD, e.g. for operations like first()
   * @param callSite where in the user program this job was called
   * @param resultHandler callback to pass each result to
   * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name
   *
   * @throws Exception when the job fails
   */
  def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)                   ------1)
    // Note: Do not call Await.ready(future) because that calls `scala.concurrent.blocking`,
    // which causes concurrent SQL executions to fail if a fork-join pool is used. Note that
    // due to idiosyncrasies in Scala, `awaitPermission` is not actually used anywhere so it's
    // safe to pass in null here. For more detail, see SPARK-13747.
    val awaitPermission = null.asInstanceOf[scala.concurrent.CanAwait]                                   
    waiter.completionFutu
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值