DAGScheduler 划分stage算法和提供task最佳位置算法剖析

最新推荐文章于 2021-12-14 21:49:50 发布

小飞猪666

最新推荐文章于 2021-12-14 21:49:50 发布

阅读量397

点赞数

分类专栏：机器学习 spark Scala

本文链接：https://blog.csdn.net/yangshaojun1992/article/details/88652447

版权

机器学习同时被 3 个专栏收录

34 篇文章 3 订阅

订阅专栏

spark

32 篇文章 1 订阅

订阅专栏

Scala

22 篇文章 0 订阅

订阅专栏

1） sc.textFile分析

/**
    * 1、首先hadoopFile()方法的调用会创建一个hadoopRDD,其中的元素其实是<k,value>pair
    *    key是hdfs或者是文本文件的每一行的offset，value是文本行。然后对hadoopRDD调用map方法
    *    会剔除key值，只保留value;然后会获得一个mappartitionRDD,mappartitionRDD内部的元素，
    *    其实就是一行一行的文本行。
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   * @param path path to the text file on a supported file system
   * @param minPartitions suggested minimum number of partitions for the resulting RDD
   * @return RDD of lines of the text file
   */
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    //创建一个HadoopFileRDD;然后让其执行map方法(只取其文本行)
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

2）FlatMap方法分析

/**    
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

3）Map方法分析


  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

4）reduceBykey方法使用

reduceBykey方法并不在RDD这个类中；在RDD中有个隐式转换（rddToPairRDDFunctions）它会转为reduceBykey。

 // The following implicit functions were in SparkContext before 1.3 and users had to
  // `import SparkContext._` to enable them. Now we move them here to make the compiler find
  // them automatically. However, we still keep the old functions in SparkContext for backward
  // compatibility and forward to the following functions directly.

  implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): 
    PairRDDFunctions[K, V] = {
    new PairRDDFunctions(rdd)
  }

==================================================================================
在PairRDDFunctions中存在着reduceByKey
/**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce.
   */
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

5）RDD的foreach方法

 // Actions (launch a job to return a value to the user program)

  /**
   * Applies a function f to all elements of this RDD.
   */
  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

在这个foreach方法中执行runJob方法，接下来不停的调用多个runJob方法。直到最后一个runJob方法才是关键。

//通过SparkContext初始化时候创建的dagScheduler对象，调用其rubJob方法。

 /**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like `first()`
   * @param resultHandler callback to pass each result to
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

Note:不只是foreach中的dagScheduler会去调用rubJob方法；其他所有的action算子都会到最后通过dagScheduler调用runJob方法。

DAGScheduler.scala

step1:DAGScheduler.scala

 /**
    * DAGScheduler的job调度的核心入口
    * @param jobId
    * @param finalRDD
    * @param func
    * @param partitions
    * @param callSite
    * @param listener
    * @param properties
    */
  private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
    //第一步：创建finalStage(Resultstage对象)
    var finalStage: ResultStage = null
    try {
      // New stage creation may throw an exception if, for example, jobs are run on a
      // HadoopRDD whose underlying HDFS files have been deleted.
      //   1) 使用触发job的最后一个RDD,创建finalStage(Resultstage对象)
      //    并且将finalStage加入到DAGScheduler内部的内存缓存中
      finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
    } catch {
      case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }
    //第二步: 2) 用finalStage,创建一个job 。finalStage(就是说这个job的最后一个stage，当然就是我们的finalStage了)。
    val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
    clearCacheLocs()
    logInfo("Got job %s (%s) with %d output partitions".format(
      job.jobId, callSite.shortForm, partitions.length))
    logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
    logInfo("Parents of final stage: " + finalStage.parents)
    logInfo("Missing parents: " + getMissingParentStages(finalStage))

    val jobSubmissionTime = clock.getTimeMillis()
    //第三步：3)  将job加入到内存缓存中
    jobIdToActiveJob(jobId) = job
    activeJobs += job
    finalStage.setActiveJob(job)
    val stageIds = jobIdToStageIds(jobId).toArray
    val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
    listenerBus.post(
      SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
    // 第四步：4) 使用此方法提交finalstage
    // 这个方法的调用，其实会导致第一个stage的提交
    // 并且会导致其他所有的stage，都给放入到waitingStages队列中
    submitStage(finalStage)
    /**
      * stage 划分算法，实在是太重要了，因为对于spark高手，或者spark精通人员来说
      * 必须对Stage的划分很清晰，知道你自己编写的spark application被划分为了几个job
      * 每个job被划分了几个stage
      * 每个stage，包含了你的哪些代码
      * 只有知道了每个stage包含了哪些代码之后，
      * 在线上，如果你发现了某个stage执行特别的慢，或者mougestage一直报错
      * 你才能针对那个stage对应的代码，去排查问题，或者性能调优。
      */
    /**
      * Stage划分算法总结
      * 1) 从finalStage倒推
      * 2) 通过宽依赖，来进行新的Stage的划分
      * 3) 通过递归，优先提交父Stage
      */
  }

step2:DAGScheduler.scala

  /**
    * 提交Stage的方法
    * 这个其实就是划分Stage的入口
    * 但是，stage划分算法，其实就是由submitStage()方法与getMissingParentStages()方法共同组成的。
    */
  /** Submits stage, but first recursively submits any missing parents. */
  private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        // 调用getMissingParentStages()方法 去获取当前这个Stage的父Stage
        val missing = getMissingParentStages(stage).sortBy(_.id)
        logDebug("missing: " + missing)
        //这里其实会反复的递归调用 直到最初的stage，他没有父stage 那么此时就会去首先提交第一个stage
        //其余的stage，此时都会在waitingStages中
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
        // 提交stage的中的tasks
          submitMissingTasks(stage, jobId.get)
        } else {
          for (parent <- missing) {
            //递归调用 submitStage 方法去提交父Stage
            //这里的递归其实就是stage划分算法的推动者和精髓
            submitStage(parent)
          }
          // 并且将当前stage，放入waitingStages等待执行的Stage的队列中。
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }

step3:

  /** Called when stage's parents are available and we can now do its task. */
  /**
    * 提交Stage 为stage创建一批task task数量和partition数量相同
    * @param stage
    * @param jobId
    */
  private def submitMissingTasks(stage: Stage, jobId: Int) {
    logDebug("submitMissingTasks(" + stage + ")")

    // First figure out the indexes of partition ids to compute.
    //获取你要创建的task的数量
    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

    // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
    // with this Stage
    val properties = jobIdToActiveJob(jobId).properties
    // 将stage加入到runningStages队列中
    runningStages += stage
    // SparkListenerStageSubmitted should be posted before testing whether tasks are
    // serializable. If tasks are not serializable, a SparkListenerStageCompleted event
    // will be posted, which should always come after a corresponding SparkListenerStageSubmitted
    // event.
    stage match {
      case s: ShuffleMapStage =>
        outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
      case s: ResultStage =>
        outputCommitCoordinator.stageStart(
          stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
    }
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    } catch {
      case NonFatal(e) =>
        stage.makeNewStageAttempt(partitionsToCompute.size)
        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }

    stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
    listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

    // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
    // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
    // the serialized copy of the RDD and for each task we will deserialize it, which means each
    // task gets a different copy of the RDD. This provides stronger isolation between tasks that
    // might modify state of objects referenced in their closures. This is necessary in Hadoop
    // where the JobConf/Configuration object is not thread-safe.
    var taskBinary: Broadcast[Array[Byte]] = null
    try {
      // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
      // For ResultTask, serialize and broadcast (rdd, func).
      val taskBinaryBytes: Array[Byte] = stage match {
        case stage: ShuffleMapStage =>
          JavaUtils.bufferToArray(
            closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
        case stage: ResultStage =>
          JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
      }

      taskBinary = sc.broadcast(taskBinaryBytes)
    } catch {
      // In the case of a failure during serialization, abort the stage.
      case e: NotSerializableException =>
        abortStage(stage, "Task not serializable: " + e.toString, Some(e))
        runningStages -= stage

        // Abort execution
        return
      case NonFatal(e) =>
        abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }
    // 为stage创建指定数量的task
    // 这里很关键的一点就是：task的最佳位置计算方法
    val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)//给每一个partition创建一个task 给每个task计算最佳位置
            val part = stage.rdd.partitions(id)
            stage.pendingPartitions += id
            // 然后对于finalStage之外的stage，他的isShuffleMap都是true 所有会创建ShuffleMapTask
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId)
          }
          //如果不是shuffleMap，那么就是finalStage finalStage是创建resultTask的
        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = stage.rdd.partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
          }
      }
    } catch {
      case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }

    if (tasks.size > 0) {
      logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
        s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
      // 最后针对stage的task 创建TaskSet对象，调用TaskScheduler的submitTask()方法 提交TaskSet
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
    } else {
      // Because we posted SparkListenerStageSubmitted earlier, we should mark
      // the stage as completed here in case there are no tasks to run
      markStageAsFinished(stage, None)

      val debugString = stage match {
        case stage: ShuffleMapStage =>
          s"Stage ${stage} is actually done; " +
            s"(available: ${stage.isAvailable}," +
            s"available outputs: ${stage.numAvailableOutputs}," +
            s"partitions: ${stage.numPartitions})"
        case stage : ResultStage =>
          s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
      }
      logDebug(debugString)

      submitWaitingChildStages(stage)
    }
  }

小飞猪666

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
DAGScheduler 划分stage算法和提供task最佳位置算法剖析

1） sc.textFile分析/** * 1、首先hadoopFile()方法的调用会创建一个hadoopRDD,其中的元素其实是<k,value>pair * key是hdfs或者是文本文件的每一行的offset，value是文本行。然后对hadoopRDD调用map方法 * 会剔除key值，只保留value;然后会获得一个mapparti...
复制链接

扫一扫

专栏目录