Spark Shuffle系列-----1. Spark Shuffle与任务调度之间的关系

最新推荐文章于 2022-06-17 22:42:57 发布

heayin123

最新推荐文章于 2022-06-17 22:42:57 发布

阅读量3.8k

点赞数

分类专栏：大数据 spark

本文链接：https://blog.csdn.net/u012684933/article/details/49074185

版权

大数据同时被 2 个专栏收录

31 篇文章 1 订阅

订阅专栏

spark

21 篇文章 4 订阅

订阅专栏

Spark根据RDD间的依赖关系是否是Shuffle依赖进行Stage的划分，为了说明问题在这里先执行的Stage标记为Stage1，后执行的Stage标记为Stage2。Shuffle分2步操作

Map操作和Recude操作可以通过下面这个图表示出来：

1. Map操作。Map操作在Stage1结束的时候执行；Map操作的作用是将Stage1阶段的一个partition的数据写入到Shuffle文件，Shuffle文件保存在执行Map操作的节点的本地磁盘

2. Reduce操作。 Reduce操作在Stage2开始的时候执行；Reduce操作的作用是读取Map操作生成的Shuffle文件，生成Stage2的一个partition

Map操作与Recude操作通过ShuffledRDD类和MapOutputTrackerMaster类联系起来。

通过ShuffledRDD可以找到Stage1和Stage2之间的依赖关系，这个依赖关系里面包含了横跨2个Stage间ShuffledRDD的分区算法类Partitioner，数据合并类Aggregator、数据序列化类Serializer等。

MapOutputTrackerMaster类保存了Stage1是如何生成Shuffle文件的，Stage2根据MapOutputTrackerMaster类对Shuffle文件的描述读取Shuffle磁盘的文件，产生Stage2的一个分区的数据。

Stage1(Shuffle Map任务)执行完成之后，Spark Driver采用如下时序图进行任务的调度启动Stage2：

上面的时序图对理解Spark shuffle任务调度特别重要，由于显示的不清楚请同学们放大观看，放大后能够清晰显示！

ShuffleMapTask运行结束之后，将生成的Shuffle文件信息返回给TaskRunner类，TaskRunner调用CoarseGrainedExecutorBackend.statusUpdate方法将Shuffle文件信息发送给CoarseGrainedSchedulerBackend类给CoarseGrainedSchedulerBackend类接收到消息后，会调用TaskSchedulerImpl.statusUpate处理任务状态变更消息。

TaskSchdeulerImpl.statusUpdate主要干了以下2件事情：

1. 通过调用TaskSetManager.removeRunningTask将成功完成的任务从TaskSetManager.runningTasksSet HashMap中删除

2.调用TaskResultGetter.enqueueSuccessFullTask将ShuffleMapTask返回信息deserialize，还原ShuffleMapTask返回的对象，这个对象会发送给TaskSchedulerImpl。.默认情况下，ShuffleMapTask返回结果大于1G返回结果会丢弃，并将结果丢弃信息发送给Driver；如果返回信息大于10M-200K，则将返回信息serialize后保存到BlockManager；其它情况则将ShuffleMapTask返回序列化后直接发送给Driver。所以在这一步中，TaskResultGetter.enqueueSuccessFullTask根据这几种情况对结果分别加以处理。处理完成之后会调用TaskSchedulerImpl.handleSuccessfullTask

相关代码为：

def enqueueSuccessfulTask(
    taskSetManager: TaskSetManager, tid: Long, serializedData: ByteBuffer) {
    getTaskResultExecutor.execute(new Runnable {
      override def run(): Unit = Utils.logUncaughtExceptions {
        try {
          val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
              /*
              * 直接返回结果的处理，这种情况下返回信息没有保存在BlockManager
              * */
            case directResult: DirectTaskResult[_] =>
              if (!taskSetManager.canFetchMoreResults(serializedData.limit())) {
                return
              }
              // deserialize "value" without holding any lock so that it won't block other threads.
              // We should call it here, so that when it's called again in
              // "TaskSetManager.handleSuccessfulTask", it does not need to deserialize the value.
              directResult.value()
              (directResult, serializedData.limit())
              /*
              * 间接返回结果的处理，这种情况下返回信息保存在BlockManager。从BlockManager读取返回信息之后，需要将返回信息从BlockManager删除
              * */
            case IndirectTaskResult(blockId, size) =>
              if (!taskSetManager.canFetchMoreResults(size)) {
                // dropped by executor if size is larger than maxResultSize
                sparkEnv.blockManager.master.removeBlock(blockId)
                return
              }
              logDebug("Fetching indirect task result for TID %s".format(tid))
              scheduler.handleTaskGettingResult(taskSetManager, tid)
              val serializedTaskResult = sparkEnv.blockManager.getRemoteBytes(blockId)
              if (!serializedTaskResult.isDefined) {
                /* We won't be able to get the task result if the machine that ran the task failed
                 * between when the task ended and when we tried to fetch the result, or if the
                 * block manager had to flush the result. */
                scheduler.handleFailedTask(
                  taskSetManager, tid, TaskState.FINISHED, TaskResultLost)
                return
              }
              val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
                serializedTaskResult.get)
              sparkEnv.blockManager.master.removeBlock(blockId)
              (deserializedResult, size)
          }

          result.metrics.setResultSize(size)
          /*
          * Task返回信息已经处理完毕，调用TaskSchedulerImpl.handleSuccessfulTask进行近一步的处理
          * */
          scheduler.handleSuccessfulTask(taskSetManager, tid, result)
        } catch {
          case cnf: ClassNotFoundException =>
            val loader = Thread.currentThread.getContextClassLoader
            taskSetManager.abort("ClassNotFound with classloader: " + loader)
          // Matching NonFatal so we don't catch the ControlThrowable from the "return" above.
          case NonFatal(ex) =>
            logError("Exception while getting task result", ex)
            taskSetManager.abort("Exception while getting task result: %s".format(ex))
        }
      }
    })
  }

TaskSchedulerImpl.handleSuccessfulTask主要调用了TaskSetManager.handleSuccessfulTask方法进入了任务集内部的任务调度处理。

TaskSetManager.handleSuccessfulTask调用DAGScheduler.taskEnded方法给DAGSchedulerEventProcessLoop发送任务完成CompletionEvent消息，接收到消息之后调用DAGScheduler.handleTaskCompletion方法处理任务完成的收尾工作。

在DAGScheduler.handleTaskCompletion方法首先调用ShuffleMapStage.addOutputLoc将TaskResultGetter.enqueueSuccessFullTask deserialize出来的对象存入ShuffleMapStage.outLocs数组中，这个对象就是一个Shuffle文件的描述信息。当一个Stage的所有Task都完成之后，DAGScheduler把ShuffleMapStage.outLocs的信息通过调用MapOutputTrackerMaster.registerMapOutputs方法将每个Stage1分区对应的Shuffle文件消息保存到MapOutputTrackerMaster。Stage1成功完成之后，从Stage等待队列中找到所有没有Parent的Stage，执行DAGScheduler.submitMissingTasks把每个Stage转化成任务放到任务等待队列，然后申请资源，执行任务。在这里的包含了以Shuffle reduce操作开始的Stage2。代码如下：

private[scheduler] def handleTaskCompletion(event: CompletionEvent) {
    val task = event.task
    val stageId = task.stageId
    val taskType = Utils.getFormattedClassName(task)

    outputCommitCoordinator.taskCompleted(stageId, task.partitionId,
      event.taskInfo.attempt, event.reason)

    // The success case is dealt with separately below, since we need to compute accumulator
    // updates before posting.
    if (event.reason != Success) {
      val attemptId = stageIdToStage.get(task.stageId).map(_.latestInfo.attemptId).getOrElse(-1)
      listenerBus.post(SparkListenerTaskEnd(stageId, attemptId, taskType, event.reason,
        event.taskInfo, event.taskMetrics))
    }

    if (!stageIdToStage.contains(task.stageId)) {
      // Skip all the actions if the stage has been cancelled.
      return
    }

    val stage = stageIdToStage(task.stageId)
    event.reason match {
      case Success =>
        listenerBus.post(SparkListenerTaskEnd(stageId, stage.latestInfo.attemptId, taskType,
          event.reason, event.taskInfo, event.taskMetrics))
        stage.pendingTasks -= task
        task match {
          case rt: ResultTask[_, _] =>
            // Cast to ResultStage here because it's part of the ResultTask
            // TODO Refactor this out to a function that accepts a ResultStage
            val resultStage = stage.asInstanceOf[ResultStage]
            resultStage.resultOfJob match {
              case Some(job) =>
                if (!job.finished(rt.outputId)) {
                  updateAccumulators(event)
                  job.finished(rt.outputId) = true
                  job.numFinished += 1
                  // If the whole job has finished, remove it
                  if (job.numFinished == job.numPartitions) {
                    markStageAsFinished(resultStage)
                    cleanupStateForJobAndIndependentStages(job)
                    listenerBus.post(
                      SparkListenerJobEnd(job.jobId, clock.getTimeMillis(), JobSucceeded))
                  }

                  // taskSucceeded runs some user code that might throw an exception. Make sure
                  // we are resilient against that.
                  try {
                    job.listener.taskSucceeded(rt.outputId, event.result)
                  } catch {
                    case e: Exception =>
                      // TODO: Perhaps we want to mark the resultStage as failed?
                      job.listener.jobFailed(new SparkDriverExecutionException(e))
                  }
                }
              case None =>
                logInfo("Ignoring result from " + rt + " because its job has finished")
            }

          case smt: ShuffleMapTask =>
            val shuffleStage = stage.asInstanceOf[ShuffleMapStage]
            updateAccumulators(event)
            val status = event.result.asInstanceOf[MapStatus]
            val execId = status.location.executorId
            logDebug("ShuffleMapTask finished on " + execId)
            if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {
              logInfo("Ignoring possibly bogus ShuffleMapTask completion from " + execId)
            } else {
              /*
              * 记录shuffle map阶段一个partition的返回结果到stage.outputLocs
              * shuffle map阶段将一个partition的数据根据Partitioner对Key进行运算后shuffle到reduce阶段的N个分区中，这些数据写记录到本地disk的一个文件中
              * shuffle map阶段的返回结果就是reduce阶段每个partition的数据长度
              * 因为在这个文件中是根据partition index从0开始依次记录的，所以知道每位partition的数据长度后，也就知道了每个partition数据在这个partition的起始地址
              * */
              shuffleStage.addOutputLoc(smt.partitionId, status)
            }
            /*
            * 一个stage的所有task都执行完，没有pending的task了
            * */
            if (runningStages.contains(shuffleStage) && shuffleStage.pendingTasks.isEmpty) {
              markStageAsFinished(shuffleStage)
              logInfo("looking for newly runnable stages")
              logInfo("running: " + runningStages)
              logInfo("waiting: " + waitingStages)
              logInfo("failed: " + failedStages)

              // We supply true to increment the epoch number here in case this is a
              // recomputation of the map outputs. In that case, some nodes may have cached
              // locations with holes (from when we detected the error) and will need the
              // epoch incremented to refetch them.
              // TODO: Only increment the epoch number if this is not the first time
              //       we registered these map outputs.
              /*
              * Shuffle Stage的任务等待队列没有任务之后，将这个Stage的所有ShuffleMapTask的返回结果保存到MapOutputTrackerMaster
              * */
              mapOutputTracker.registerMapOutputs(
                shuffleStage.shuffleDep.shuffleId,
                shuffleStage.outputLocs.map(list => if (list.isEmpty) null else list.head).toArray,
                changeEpoch = true)

              clearCacheLocs()
              if (shuffleStage.outputLocs.contains(Nil)) {
                // Some tasks had failed; let's resubmit this shuffleStage
                // TODO: Lower-level scheduler should also deal with this
                logInfo("Resubmitting " + shuffleStage + " (" + shuffleStage.name +
                  ") because some of its tasks had failed: " +
                  shuffleStage.outputLocs.zipWithIndex.filter(_._1.isEmpty)
                      .map(_._2).mkString(", "))
                submitStage(shuffleStage)
              } else {
                val newlyRunnable = new ArrayBuffer[Stage]
                for (shuffleStage <- waitingStages) {
                  logInfo("Missing parents for " + shuffleStage + ": " +
                    getMissingParentStages(shuffleStage))
                }
                /*
                * 一个Stage成功完成之后，从Stage等待队列中找到所有没有Parent的Stage，执行DAGScheduler.submitMissingTasks把每个Stage转化成任务放到任务等待队列，
                * 然后申请资源，执行任务。在这里的包含了以Shuffle reduce操作开始的Stage
                * */
                for (shuffleStage <- waitingStages if getMissingParentStages(shuffleStage).isEmpty)
                {
                  newlyRunnable += shuffleStage
                }
                waitingStages --= newlyRunnable
                runningStages ++= newlyRunnable
                for {
                  shuffleStage <- newlyRunnable.sortBy(_.id)
                  jobId <- activeJobForStage(shuffleStage)
                } {
                  logInfo("Submitting " + shuffleStage + " (" +
                    shuffleStage.rdd + "), which is now runnable")
                  submitMissingTasks(shuffleStage, jobId)
                }
              }
            }
          }

      case Resubmitted =>
        logInfo("Resubmitted " + task + ", so marking it as still running")
        stage.pendingTasks += task

      case FetchFailed(bmAddress, shuffleId, mapId, reduceId, failureMessage) =>
        val failedStage = stageIdToStage(task.stageId)
        val mapStage = shuffleToMapStage(shuffleId)

        // It is likely that we receive multiple FetchFailed for a single stage (because we have
        // multiple tasks running concurrently on different executors). In that case, it is possible
        // the fetch failure has already been handled by the scheduler.
        if (runningStages.contains(failedStage)) {
          logInfo(s"Marking $failedStage (${failedStage.name}) as failed " +
            s"due to a fetch failure from $mapStage (${mapStage.name})")
          markStageAsFinished(failedStage, Some(failureMessage))
        }

        if (disallowStageRetryForTest) {
          abortStage(failedStage, "Fetch failure will not retry stage due to testing config")
        } else if (failedStages.isEmpty) {
          // Don't schedule an event to resubmit failed stages if failed isn't empty, because
          // in that case the event will already have been scheduled.
          // TODO: Cancel running tasks in the stage
          logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " +
            s"$failedStage (${failedStage.name}) due to fetch failure")
          messageScheduler.schedule(new Runnable {
            override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages)
          }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)
        }
        failedStages += failedStage
        failedStages += mapStage
        // Mark the map whose fetch failed as broken in the map stage
        if (mapId != -1) {
          mapStage.removeOutputLoc(mapId, bmAddress)
          mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
        }

        // TODO: mark the executor as failed only if there were lots of fetch failures on it
        if (bmAddress != null) {
          handleExecutorLost(bmAddress.executorId, fetchFailed = true, Some(task.epoch))
        }

      case commitDenied: TaskCommitDenied =>
        // Do nothing here, left up to the TaskScheduler to decide how to handle denied commits

      case ExceptionFailure(className, description, stackTrace, fullStackTrace, metrics) =>
        // Do nothing here, left up to the TaskScheduler to decide how to handle user failures

      case TaskResultLost =>
        // Do nothing here; the TaskScheduler handles these failures and resubmits the task.

      case other =>
        // Unrecognized failure - also do nothing. If the task fails repeatedly, the TaskScheduler
        // will abort the job.
    }
    submitWaitingStages()
  }

MapOutputTrackerMaster中保存的shuffleMap阶段的返回信息是一个二维数组，数组的第一维度是Stage1也就是Map操作所在的Stage的分区index；数组的第二维度是Stage2也就是reduce操作所在的Stage的分区index。

假设Stage1 Map操作有3个分区，Stage2 Reduce操作同样也有3个分区，那么MapOutputTrackerMaster中保存的shuffleMap阶段的返回信息可以用如下图表示：

上图中的Shuffle操作把Stage1的一个分区的数据Shuffle到了Stage2的3个分区，"Map2 reduce1长度"表示Stage1中第二个分区的数据Shuffle到Stage2第一个分区的数据长度

在DAGScheduler.submitMissingTasks方法里面，会调用到DAGScheduler.getPreferredLocs方法计算出这个Stage每个任务的位置

在DAGScheduler.getPreferredLocs方法里面，会调用到DAGScheduler.getPreferredLocsInternal方法计算出这个Stage每个任务的位置

DAGScheduler.getPreferredLocsInternal会调用MapOutputTrackerMaster.getLocationsWithLargestOutputs确定Task的位置

表示Task位置的数据结构为TaskLocation，具体源码为：

private[spark] object TaskLocation {
  // We identify hosts on which the block is cached with this prefix.  Because this prefix contains
  // underscores, which are not legal characters in hostnames, there should be no potential for
  // confusion.  See  RFC 952 and RFC 1123 for information about the format of hostnames.
  val inMemoryLocationTag = "hdfs_cache_"

  def apply(host: String, executorId: String): TaskLocation = {
    new ExecutorCacheTaskLocation(host, executorId)
  }

  /**
   * Create a TaskLocation from a string returned by getPreferredLocations.
   * These strings have the form [hostname] or hdfs_cache_[hostname], depending on whether the
   * location is cached.
   */
  def apply(str: String): TaskLocation = {
    val hstr = str.stripPrefix(inMemoryLocationTag)
    if (hstr.equals(str)) {
      new HostTaskLocation(str)
    } else {
      new HostTaskLocation(hstr)
    }
  }
}

这个数据结构通过节点的IP地址和Executor id确定出Task的位置

计算方法为：在Stage1某个TaskLocation上partition Shuffle后数据占Stage2 partition所有数据的比率大于REDUCER_PREF_LOCS_FRACTION，则这个TaskLocation会作为Stage2任务的TaskLocation

以上图"Map2 reduce2长度"为例：(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) > REDUCER_PREF_LOCS_FRACTION则使用Map2在Stage1所在节点的IP地址和Executor Id创建TaskLocation

通过TaskLocation的定义可以知道实际创建的TaskLocation类型是ExecutorCacheLocation

具体代码为：

private def getPreferredLocsInternal(
      rdd: RDD[_],
      partition: Int,
      visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
    // If the partition has already been visited, no need to re-visit.
    // This avoids exponential path exploration.  SPARK-695
    if (!visited.add((rdd, partition))) {
      // Nil has already been returned for previously visited partitions.
      return Nil
    }
    // If the partition is cached, return the cache locations
    val cached = getCacheLocs(rdd)(partition)
    if (cached.nonEmpty) {
      return cached
    }
    // If the RDD has some placement preferences (as is the case for input RDDs), get those
    val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
    if (rddPrefs.nonEmpty) {
      return rddPrefs.map(TaskLocation(_))
    }

    rdd.dependencies.foreach {
      case n: NarrowDependency[_] =>
        // If the RDD has narrow dependencies, pick the first partition of the first narrow dep
        // that has any placement preferences. Ideally we would choose based on transfer sizes,
        // but this will do for now.
        for (inPart <- n.getParents(partition)) {
          val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
          if (locs != Nil) {
            return locs
          }
        }
      case s: ShuffleDependency[_, _, _] =>
        // For shuffle dependencies, pick locations which have at least REDUCER_PREF_LOCS_FRACTION
        // of data as preferred locations
        if (shuffleLocalityEnabled &&
            rdd.partitions.size < SHUFFLE_PREF_REDUCE_THRESHOLD &&
            s.rdd.partitions.size < SHUFFLE_PREF_MAP_THRESHOLD) {
          // Get the preferred map output locations for this reducer
          /*
          *根据Stage1 shuffle map操作各个任务的返回值确定Stage2 shuffle reduce操作各个任务的本地性
          *在Stage1某个节点partition Shuffle后数据占Stage2 partition所有数据的比率大于REDUCER_PREF_LOCS_FRACTION（默认0.2），则这个
          * 节点会作为Stage2任务的启动节点
          * */
          val topLocsForReducer = mapOutputTracker.getLocationsWithLargestOutputs(s.shuffleId,
            partition, rdd.partitions.size, REDUCER_PREF_LOCS_FRACTION)
          if (topLocsForReducer.nonEmpty) {
            return topLocsForReducer.get.map(loc => TaskLocation(loc.host, loc.executorId))
          }
        }

      case _ =>
    }
    Nil
  }

DAGScheduler.submitMissingTasks会根据TaskLocation创建ShuffleMapTask或者ResultTask，如果Shuffle后的数据，没有能够满足条件创建ExecutorCacheLocation，则在创建ShuffleMapTask或者ResultTask的时候，传入的locs参数为Nil

创建完某个Stage的任务之后，然后调用TaskSchedulerImpl.submitTasks创建TaskSetManager，在创建TaskSetManager的时候会根据ShuffleMapTask或者ResultTask的TaskLocality加入到不同的HashMap中，代码如下：

<pre name="code" class="java"> private def addPendingTask(index: Int, readding: Boolean = false) {
    // Utility method that adds `index` to a list only if readding=false or it's not already there
    def addTo(list: ArrayBuffer[Int]) {
      if (!readding || !list.contains(index)) {
        list += index
      }
    }

    for (loc <- tasks(index).preferredLocations) {//preferredLocation方法返回partition所在的IP地址和Executor id
      loc match {
        case e: ExecutorCacheTaskLocation =>
          /*
          * 在创建ShuffleMapTask或者ResultTask的时候，如果这个Task传入的locs参数类型为ExecutorCacheTaskLocation则将Task id加入到pendingTasksForExecutor
          * pendingTasksForExecutor HashMap的key是Executor id， value是这个Executor对应的Task的id，可以有多个Task
          * */
          addTo(pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer))
        case e: HDFSCacheTaskLocation => {
          val exe = sched.getExecutorsAliveOnHost(loc.host)
          exe match {
            case Some(set) => {
              for (e <- set) {
                addTo(pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer))
              }
              logInfo(s"Pending task $index has a cached location at ${e.host} " +
                ", where there are executors " + set.mkString(","))
            }
            case None => logDebug(s"Pending task $index has a cached location at ${e.host} " +
                ", but there are no executors alive there.")
          }
        }
        case _ => Unit
      }
      addTo(pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer))//由于DirectDStream方式的loc.host地址不属于Spark集群和HDFS集群，所以Task加到了这个HashMap
      for (rack <- sched.getRackForHost(loc.host)) {
        addTo(pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer))
      }
    }
    /*
    * 在创建ShuffleMapTask或者ResultTask的时候，如果这个Task传入的locs参数为Nil则将Task id加入到pendingTasksWithNoPrefs
    * pendingTasksWithNoPrefs类型是ArrayBuffe，它的每个元素的value是这Task的id
    * */
    if (tasks(index).preferredLocations == Nil) {
      addTo(pendingTasksWithNoPrefs)
    }

    if (!readding) {
      /*
      * 所有任务都会加入到addPendingTasks
      * */
      allPendingTasks += index  // No point scanning this whole list to find the old task there  所有的Task都会加入到这个HashMap，包括DirectDStream情况下的Task
    }
  }

从上面的代码可见，TaskLocation类型是ExecutorCacheLocation的ShuffleMapTask或者ResultTask会同时加入到pendingTasksForExecutor、pendingTasksForHost、allPendingTasks 中。而TaskLocation类型是Nil的ShuffleMapTask或者ResultTask会同时加入到pendingTasksForHost、allPendingTasks 中

这里有一个问题需要注意：虽然相同的任务加入到了多个任务等待队列，但是不会出现任务的重复调度，具体原因会在下面任务调度的时候讲到。

TaskScheduler.submitTasks方法创建TaskSetManager之后，调用CoarseGrainedSchedulerBackend.reviveOffers申请执行Stage2的资源.

在CoarseGrainedSchedulerBackend.reviveOffers最终会调用到TaskSchedulerImpl.resourceOffer方法分配执行资源。代码如下：

def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
    // Mark each slave as alive and remember its hostname
    // Also track if new executor is added
    var newExecAvail = false
    for (o <- offers) {
      executorIdToHost(o.executorId) = o.host
      activeExecutorIds += o.executorId
      if (!executorsByHost.contains(o.host)) {
        executorsByHost(o.host) = new HashSet[String]()
        executorAdded(o.executorId, o.host)
        newExecAvail = true
      }
      for (rack <- getRackForHost(o.host)) {
        hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
      }
    }

    // Randomly shuffle offers to avoid always placing tasks on the same set of workers.
    val shuffledOffers = Random.shuffle(offers)
    // Build a list of tasks to assign to each worker.
    //tasks是一个链表，这个链表的一个元素是一个数组，数组的类型是ArrayBuffer[TaskDescription]，数组的大小跟跟这个executor的cores个数相同
    val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))
    //availableCpus是一个List，这个List的每个元素表示一个worker上这个Executor可用的cpu个数
    val availableCpus = shuffledOffers.map(o => o.cores).toArray
    val sortedTaskSets = rootPool.getSortedTaskSetQueue //从rootPool里面拿到对应的TaskSet，会使用设置的调度算法返回TaskSet的顺序
    for (taskSet <- sortedTaskSets) {
      logDebug("parentName: %s, name: %s, runningTasks: %s".format(
        taskSet.parent.name, taskSet.name, taskSet.runningTasks))
      if (newExecAvail) {
        taskSet.executorAdded()//添加了新的Executor，重新计算任务的本地性
      }
    }

    // Take each TaskSet in our scheduling order, and then offer it each node in increasing order
    // of locality levels so that it gets a chance to launch local tasks on all of them.
    // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
    var launchedTask = false
    /*
    * 对于一个任务集合，优先执行PROCESS_LOCAL任务，最后执行ANY任务
    * */
    for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
      do {
        launchedTask = resourceOfferSingleTaskSet(
            taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
      } while (launchedTask)
    }

    if (tasks.size > 0) {
      hasLaunchedTask = true
    }
    return tasks
  }

对任务集中的任务进行资源分配优先集为：PROCESS_LOCAL>NODE_LOCAL>NO_PREF>RACK_LOCAL>ANY

TaskSchedulerImpl.resourceOffer最终要调用TaskSchedulerImpl.resouceOfferSingleTaskSet为一个任务集分配资源，TaskSchedulerImpl.resouceOfferSingleTaskSet方法轮询每个Executor，从任务等待队列中拿到需要执行任务id，给轮询到的Executor执行

 private def resourceOfferSingleTaskSet(
      taskSet: TaskSetManager,
      maxLocality: TaskLocality,
      shuffledOffers: Seq[WorkerOffer],
      availableCpus: Array[Int],
      tasks: Seq[ArrayBuffer[TaskDescription]]) : Boolean = {
    var launchedTask = false
    for (i <- 0 until shuffledOffers.size) {
      val execId = shuffledOffers(i).executorId
      val host = shuffledOffers(i).host
      if (availableCpus(i) >= CPUS_PER_TASK) {//按照cpu cores个数分配task
        try {
          for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {
            tasks(i) += task //将这个task放在了第i个worker(worker顺序已经shuffle了)
            val tid = task.taskId
            taskIdToTaskSetId(tid) = taskSet.taskSet.id//记录task所在的taskset
            taskIdToExecutorId(tid) = execId//记录task所在的executor
            executorsByHost(host) += execId
            availableCpus(i) -= CPUS_PER_TASK
            assert(availableCpus(i) >= 0)
            launchedTask = true
          }
        } catch {
          case e: TaskNotSerializableException =>
            logError(s"Resource offer failed, task set ${taskSet.name} was not serializable")
            // Do not offer resources for this task, but don't throw an error to allow other
            // task sets to be submitted.
            return launchedTask
        }
      }
    }
    return launchedTask
  }

TaskSchedulerImpl.resouceOfferSingleTaskSet方法调用了TaskSetManager.resourceOffer为任务集中的单个任务进行资源分配，TaskSetManager.resourceOffer调用TaskSetManager.dequeueTask执行具体的资源分配，代码如下：

 private def dequeueTask(execId: String, host: String, maxLocality: TaskLocality.Value)
    : Option[(Int, TaskLocality.Value, Boolean)] =
  {
    /*
    * 在创建ShuffleMapTask或者ResultTask的时候，如果这个Task传入的locs参数为类型则从pendingTasksForExecutor HashMap获取任务执行的id
    * */
    for (index <- dequeueTaskFromList(execId, getPendingTasksForExecutor(execId))) {
      return Some((index, TaskLocality.PROCESS_LOCAL, false))
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.NODE_LOCAL)) {//由于KafkaRDD partition所在的Ip地址跟Executor的IP地址不同，所以Task不能从这个HashMap获取
      for (index <- dequeueTaskFromList(execId, getPendingTasksForHost(host))) {
        return Some((index, TaskLocality.NODE_LOCAL, false))
      }
    }
    /*
   * 在创建ShuffleMapTask或者ResultTask的时候，如果这个Task传入的locs参数为Nil则从pendingTasksWithNoPrefs获取任务执行的id
   * */
    if (TaskLocality.isAllowed(maxLocality, TaskLocality.NO_PREF)) {
      // Look for noPref tasks after NODE_LOCAL for minimize cross-rack traffic
      for (index <- dequeueTaskFromList(execId, pendingTasksWithNoPrefs)) {
        return Some((index, TaskLocality.PROCESS_LOCAL, false))
      }
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.RACK_LOCAL)) {
      for {
        rack <- sched.getRackForHost(host)
        index <- dequeueTaskFromList(execId, getPendingTasksForRack(rack))
      } {
        return Some((index, TaskLocality.RACK_LOCAL, false))
      }
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.ANY)) {//KafkaRDD的处理Task从addPendingTask这个HashMap获取
      for (index <- dequeueTaskFromList(execId, allPendingTasks)) {
        return Some((index, TaskLocality.ANY, false))
      }
    }

    // find a speculative task if all others tasks have been scheduled
    dequeueSpeculativeTask(execId, host, maxLocality).map {
      case (taskIndex, allowedLocality) => (taskIndex, allowedLocality, true)}
  }

TaskSetManager.dequeueTask调用TaskSetManager.dequeueTaskFromList从任务等待链表中获取任务：

private def dequeueTaskFromList(execId: String, list: ArrayBuffer[Int]): Option[Int] = {
    var indexOffset = list.size
    while (indexOffset > 0) {
      indexOffset -= 1
      val index = list(indexOffset)
      if (!executorIsBlacklisted(execId, index)) {
        // This should almost always be list.trimEnd(1) to remove tail
        list.remove(indexOffset)
        /*
        * 若任务已经开始执行，copiesRunning(index)==1
        * 若任务已经成功完成，则successful(index) ==true
        * 通过这个判断避免任务重复执行
        * */
        if (copiesRunning(index) == 0 && !successful(index)) {
          return Some(index)
        }
      }
    }
    None
  }

对于本地性非常高的任务，比如ExecutorCacheLocation任务，它被添加到pendingTasksForExecutor pendingTasksForHost allPendingTasks 3个数据结构中，若任务已经开始执行，copiesRunning(index)==1，若任务已经成功完成，则successful(index) ==true， 通过判断copiesRunning(index) == 0 && !successful(index)避免任务重复执行

通过以上分析可以得出结论，

Stage2任务在哪个Executor执行分2中情况

情况1：创建Stage2任务的时候，传入的loc2类型是ExecutorCacheLocation，也就是(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) > REDUCER_PREF_LOCS_FRACTION，这种任务执行的Executor会在Map2所执行的Executor

情况2：创建Stage2任务的时候，传入的loc2类型是Nil，也就是(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) < REDUCER_PREF_LOCS_FRACTION，这种任务执行的Executor会随机分配一个Executor执行

heayin123

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Spark Shuffle系列-----1. Spark Shuffle与任务调度之间的关系

Spark根据RDD间的依赖关系是否是Shuffle依赖进行Stage的划分，先执行的Stage标记为Stage1，后执行的Stage标记为Stage2。Shuffle是Stage分2步操作 Map操作和Recude操作可以通过下面这个图表示出来： 1. Map操作。Map操作在Stage1结束的时候执行；Map操作的作用是将Stage1阶段的一个pa
复制链接

扫一扫