Spark Shuffle系列-----1. Spark Shuffle与任务调度之间的关系

     Spark根据RDD间的依赖关系是否是Shuffle依赖进行Stage的划分,为了说明问题在这里先执行的Stage标记为Stage1,后执行的Stage标记为Stage2。Shuffle分2步操作

    Map操作和Recude操作可以通过下面这个图表示出来:



     1. Map操作。Map操作在Stage1结束的时候执行;Map操作的作用是将Stage1阶段的一个partition的数据写入到Shuffle文件,Shuffle文件保存在执行Map操作的节点的本地磁盘

      2. Reduce操作。 Reduce操作在Stage2开始的时候执行;Reduce操作的作用是读取Map操作生成的Shuffle文件,生成Stage2的一个partition

    Map操作与Recude操作通过ShuffledRDD类和MapOutputTrackerMaster类联系起来。

    通过ShuffledRDD可以找到Stage1和Stage2之间的依赖关系,这个依赖关系里面包含了横跨2个Stage间ShuffledRDD的分区算法类Partitioner,数据合并类Aggregator、数据序列化类Serializer等。

    MapOutputTrackerMaster类保存了Stage1是如何生成Shuffle文件的,Stage2根据MapOutputTrackerMaster类对Shuffle文件的描述读取Shuffle磁盘的文件,产生Stage2的一个分区的数据。

    Stage1(Shuffle Map任务)执行完成之后,Spark Driver采用如下时序图进行任务的调度启动Stage2:



上面的时序图对理解Spark shuffle任务调度特别重要,由于显示的不清楚请同学们放大观看,放大后能够清晰显示!

ShuffleMapTask运行结束之后,将生成的Shuffle文件信息返回给TaskRunner类,TaskRunner调用CoarseGrainedExecutorBackend.statusUpdate方法将Shuffle文件信息发送给CoarseGrainedSchedulerBackend类给CoarseGrainedSchedulerBackend类接收到消息后,会调用TaskSchedulerImpl.statusUpate处理任务状态变更消息。

TaskSchdeulerImpl.statusUpdate主要干了以下2件事情:

1. 通过调用TaskSetManager.removeRunningTask将成功完成的任务从TaskSetManager.runningTasksSet HashMap中删除

2.调用TaskResultGetter.enqueueSuccessFullTask将ShuffleMapTask返回信息deserialize,还原ShuffleMapTask返回的对象,这个对象会发送给TaskSchedulerImpl。.默认情况下,ShuffleMapTask返回结果大于1G返回结果会丢弃,并将结果丢弃信息发送给Driver;如果返回信息大于10M-200K,则将返回信息serialize后保存到BlockManager;其它情况则将ShuffleMapTask返回序列化后直接发送给Driver。所以在这一步中,TaskResultGetter.enqueueSuccessFullTask根据这几种情况对结果分别加以处理。处理完成之后会调用TaskSchedulerImpl.handleSuccessfullTask

相关代码为:

def enqueueSuccessfulTask(
    taskSetManager: TaskSetManager, tid: Long, serializedData: ByteBuffer) {
    getTaskResultExecutor.execute(new Runnable {
      override def run(): Unit = Utils.logUncaughtExceptions {
        try {
          val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
              /*
              * 直接返回结果的处理,这种情况下返回信息没有保存在BlockManager
              * */
            case directResult: DirectTaskResult[_] =>
              if (!taskSetManager.canFetchMoreResults(serializedData.limit())) {
                return
              }
              // deserialize "value" without holding any lock so that it won't block other threads.
              // We should call it here, so that when it's called again in
              // "TaskSetManager.handleSuccessfulTask", it does not need to deserialize the value.
              directResult.value()
              (directResult, serializedData.limit())
              /*
              * 间接返回结果的处理,这种情况下返回信息保存在BlockManager。从BlockManager读取返回信息之后,需要将返回信息从BlockManager删除
              * */
            case IndirectTaskResult(blockId, size) =>
              if (!taskSetManager.canFetchMoreResults(size)) {
                // dropped by executor if size is larger than maxResultSize
                sparkEnv.blockManager.master.removeBlock(blockId)
                return
              }
              logDebug("Fetching indirect task result for TID %s".format(tid))
              scheduler.handleTaskGettingResult(taskSetManager, tid)
              val serializedTaskResult = sparkEnv.blockManager.getRemoteBytes(blockId)
              if (!serializedTaskResult.isDefined) {
                /* We won't be able to get the task result if the machine that ran the task failed
                 * between when the task ended and when we tried to fetch the result, or if the
                 * block manager had to flush the result. */
                scheduler.handleFailedTask(
                  taskSetManager, tid, TaskState.FINISHED, TaskResultLost)
                return
              }
              val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
                serializedTaskResult.get)
              sparkEnv.blockManager.master.removeBlock(blockId)
              (deserializedResult, size)
          }

          result.metrics.setResultSize(size)
          /*
          * Task返回信息已经处理完毕,调用TaskSchedulerImpl.handleSuccessfulTask进行近一步的处理
          * */
          scheduler.handleSuccessfulTask(taskSetManager, tid, result)
        } catch {
          case cnf: ClassNotFoundException =>
            val loader = Thread.currentThread.getContextClassLoader
            taskSetManager.abort("ClassNotFound with classloader: " + loader)
          // Matching NonFatal so we don't catch the ControlThrowable from the "return" above.
          case NonFatal(ex) =>
            logError("Exception while getting task result", ex)
            taskSetManager.abort("Exception while getting task result: %s".format(ex))
        }
      }
    })
  }

TaskSchedulerImpl.handleSuccessfulTask主要调用了TaskSetManager.handleSuccessfulTask方法进入了任务集内部的任务调度处理。

TaskSetManager.handleSuccessfulTask调用DAGScheduler.taskEnded方法给DAGSchedulerEventProcessLoop发送任务完成CompletionEvent消息,接收到消息之后调用DAGScheduler.handleTaskCompletion方法处理任务完成的收尾工作。

在DAGScheduler.handleTaskCompletion方法首先调用ShuffleMapStage.addOutputLoc将TaskResultGetter.enqueueSuccessFullTask deserialize出来的对象存入ShuffleMapStage.outLocs数组中,这个对象就是一个Shuffle文件的描述信息。当一个Stage的所有Task都完成之后,DAGScheduler把ShuffleMapStage.outLocs的信息通过调用MapOutputTrackerMaster.registerMapOutputs方法将每个Stage1分区对应的Shuffle文件消息保存到MapOutputTrackerMaster。Stage1成功完成之后,从Stage等待队列中找到所有没有Parent的Stage,执行DAGScheduler.submitMissingTasks把每个Stage转化成任务放到任务等待队列,然后申请资源,执行任务。在这里的包含了以Shuffle reduce操作开始的Stage2。代码如下:

private[scheduler] def handleTaskCompletion(event: CompletionEvent) {
    val task = event.task
    val stageId = task.stageId
    val taskType = Utils.getFormattedClassName(task)

    outputCommitCoordinator.taskCompleted(stageId, task.partitionId,
      event.taskInfo.attempt, event.reason)

    // The success case is dealt with separately below, since we need to compute accumulator
    // updates before posting.
    if (event.reason != Success) {
      val attemptId = stageIdToStage.get(task.stageId).map(_.latestInfo.attemptId).getOrElse(-1)
      listenerBus.post(SparkListenerTaskEnd(stageId, attemptId, taskType, event.reason,
        event.taskInfo, event.taskMetrics))
    }

    if (!stageIdToStage.contains(task.stageId)) {
      // Skip all the actions if the stage has been cancelled.
      return
    }

    val stage = stageIdToStage(task.stageId)
    event.reason match {
      case Success =>
        listenerBus.post(SparkListenerTaskEnd(stageId, stage.latestInfo.attemptId, taskType,
          event.reason, event.taskInfo, event.taskMetrics))
        stage.pendingTasks -= task
        task match {
          case rt: ResultTask[_, _] =>
            // Cast to ResultStage here because it's part of the ResultTask
            // TODO Refactor this out to a function that accepts a ResultStage
            val resultStage = stage.asInstanceOf[ResultStage]
            resultStage.resultOfJob match {
              case Some(job) =>
                if (!job.finished(rt.outputId)) {
                  updateAccumulators(event)
                  job.finished(rt.outputId) = true
                  job.numFinished += 1
                  // If the whole job has finished, remove it
                  if (job.numFinished == job.numPartitions) {
                    markStageAsFinished(resultStage)
                    cleanupStateForJobAndIndependentStages(job)
                    listenerBus.post(
                      SparkListenerJobEnd(job.jobId, clock.getTimeMillis(), JobSucceeded))
                  }

                  // taskSucceeded runs some user code that might throw an exception. Make sure
                  // we are resilient against that.
                  try {
                    job.listener.taskSucceeded(rt.outputId, event.result)
                  } catch {
                    case e: Exception =>
                      // TODO: Perhaps we want to mark the resultStage as failed?
                      job.listener.jobFailed(new SparkDriverExecutionException(e))
                  }
                }
              case None =>
                logInfo("Ignoring result from " + rt + " because its job has finished")
            }

          case smt: ShuffleMapTask =>
            val shuffleStage = stage.asInstanceOf[ShuffleMapStage]
            updateAccumulators(event)
            val status = event.result.asInstanceOf[MapStatus]
            val execId = status.location.executorId
            logDebug("ShuffleMapTask finished on " + execId)
            if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {
              logInfo("Ignoring possibly bogus ShuffleMapTask completion from " + execId)
            } else {
              /*
              * 记录shuffle map阶段一个partition的返回结果到stage.outputLocs
              * shuffle map阶段将一个partition的数据根据Partitioner对Key进行运算后shuffle到reduce阶段的N个分区中,这些数据写记录到本地disk的一个文件中
              * shuffle map阶段的返回结果就是reduce阶段每个partition的数据长度
              * 因为在这个文件中是根据partition index从0开始依次记录的,所以知道每位partition的数据长度后,也就知道了每个partition数据在这个partition的起始地址
              * */
              shuffleStage.addOutputLoc(smt.partitionId, status)
            }
            /*
            * 一个stage的所有task都执行完,没有pending的task了
            * */
            if (runningStages.contains(shuffleStage) && shuffleStage.pendingTasks.isEmpty) {
              markStageAsFinished(shuffleStage)
              logInfo("looking for newly runnable stages")
              logInfo("running: " + runningStages)
              logInfo("waiting: " + waitingStages)
              logInfo("failed: " + failedStages)

              // We supply true to increment the epoch number here in case this is a
              // recomputation of the map outputs. In that case, some nodes may have cached
              // locations with holes (from when we detected the error) and will need the
              // epoch incremented to refetch them.
              // TODO: Only increment the epoch number if this is not the first time
              //       we registered these map outputs.
              /*
              * Shuffle Stage的任务等待队列没有任务之后,将这个Stage的所有ShuffleMapTask的返回结果保存到MapOutputTrackerMaster
              * */
              mapOutputTracker.registerMapOutputs(
                shuffleStage.shuffleDep.shuffleId,
                shuffleStage.outputLocs.map(list => if (list.isEmpty) null else list.head).toArray,
                changeEpoch = true)

              clearCacheLocs()
              if (shuffleStage.outputLocs.contains(Nil)) {
                // Some tasks had failed; let's resubmit this shuffleStage
                // TODO: Lower-level scheduler should also deal with this
                logInfo("Resubmitting " + shuffleStage + " (" + shuffleStage.name +
                  ") because some of its tasks had failed: " +
                  shuffleStage.outputLocs.zipWithIndex.filter(_._1.isEmpty)
                      .map(_._2).mkString(", "))
                submitStage(shuffleStage)
              } else {
                val newlyRunnable = new ArrayBuffer[Stage]
                for (shuffleStage <- waitingStages) {
                  logInfo("Missing parents for " + shuffleStage + ": " +
                    getMissingParentStages(shuffleStage))
                }
                /*
                * 一个Stage成功完成之后,从Stage等待队列中找到所有没有Parent的Stage,执行DAGScheduler.submitMissingTasks把每个Stage转化成任务放到任务等待队列,
                * 然后申请资源,执行任务。在这里的包含了以Shuffle reduce操作开始的Stage
                * */
                for (shuffleStage <- waitingStages if getMissingParentStages(shuffleStage).isEmpty)
                {
                  newlyRunnable += shuffleStage
                }
                waitingStages --= newlyRunnable
                runningStages ++= newlyRunnable
                for {
                  shuffleStage <- newlyRunnable.sortBy(_.id)
                  jobId <- activeJobForStage(shuffleStage)
                } {
                  logInfo("Submitting " + shuffleStage + " (" +
                    shuffleStage.rdd + "), which is now runnable")
                  submitMissingTasks(shuffleStage, jobId)
                }
              }
            }
          }

      case Resubmitted =>
        logInfo("Resubmitted " + task + ", so marking it as still running")
        stage.pendingTasks += task

      case FetchFailed(bmAddress, shuffleId, mapId, reduceId, failureMessage) =>
        val failedStage = stageIdToStage(task.stageId)
        val mapStage = shuffleToMapStage(shuffleId)

        // It is likely that we receive multiple FetchFailed for a single stage (because we have
        // multiple tasks running concurrently on different executors). In that case, it is possible
        // the fetch failure has already been handled by the scheduler.
        if (runningStages.contains(failedStage)) {
          logInfo(s"Marking $failedStage (${failedStage.name}) as failed " +
            s"due to a fetch failure from $mapStage (${mapStage.name})")
          markStageAsFinished(failedStage, Some(failureMessage))
        }

        if (disallowStageRetryForTest) {
          abortStage(failedStage, "Fetch failure will not retry stage due to testing config")
        } else if (failedStages.isEmpty) {
          // Don't schedule an event to resubmit failed stages if failed isn't empty, because
          // in that case the event will already have been scheduled.
          // TODO: Cancel running tasks in the stage
          logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " +
            s"$failedStage (${failedStage.name}) due to fetch failure")
          messageScheduler.schedule(new Runnable {
            override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages)
          }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)
        }
        failedStages += failedStage
        failedStages += mapStage
        // Mark the map whose fetch failed as broken in the map stage
        if (mapId != -1) {
          mapStage.removeOutputLoc(mapId, bmAddress)
          mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
        }

        // TODO: mark the executor as failed only if there were lots of fetch failures on it
        if (bmAddress != null) {
          handleExecutorLost(bmAddress.executorId, fetchFailed = true, Some(task.epoch))
        }

      case commitDenied: TaskCommitDenied =>
        // Do nothing here, left up to the TaskScheduler to decide how to handle denied commits

      case ExceptionFailure(className, description, stackTrace, fullStackTrace, metrics) =>
        // Do nothing here, left up to the TaskScheduler to decide how to handle user failures

      case TaskResultLost =>
        // Do nothing here; the TaskScheduler handles these failures and resubmits the task.

      case other =>
        // Unrecognized failure - also do nothing. If the task fails repeatedly, the TaskScheduler
        // will abort the job.
    }
    submitWaitingStages()
  }

 

MapOutputTrackerMaster中保存的shuffleMap阶段的返回信息是一个二维数组,数组的第一维度是Stage1也就是Map操作所在的Stage的分区index;数组的第二维度是Stage2也就是reduce操作所在的Stage的分区index。

假设Stage1 Map操作有3个分区,Stage2 Reduce操作同样也有3个分区,那么MapOutputTrackerMaster中保存的shuffleMap阶段的返回信息可以用如下图表示:

上图中的Shuffle操作把Stage1的一个分区的数据Shuffle到了Stage2的3个分区,"Map2 reduce1长度"表示Stage1中第二个分区的数据Shuffle到Stage2第一个分区的数据长度

在DAGScheduler.submitMissingTasks方法里面,会调用到DAGScheduler.getPreferredLocs方法计算出这个Stage每个任务的位置

在DAGScheduler.getPreferredLocs方法里面,会调用到DAGScheduler.getPreferredLocsInternal方法计算出这个Stage每个任务的位置

DAGScheduler.getPreferredLocsInternal会调用MapOutputTrackerMaster.getLocationsWithLargestOutputs确定Task的位置

表示Task位置的数据结构为TaskLocation,具体源码为:

private[spark] object TaskLocation {
  // We identify hosts on which the block is cached with this prefix.  Because this prefix contains
  // underscores, which are not legal characters in hostnames, there should be no potential for
  // confusion.  See  RFC 952 and RFC 1123 for information about the format of hostnames.
  val inMemoryLocationTag = "hdfs_cache_"

  def apply(host: String, executorId: String): TaskLocation = {
    new ExecutorCacheTaskLocation(host, executorId)
  }

  /**
   * Create a TaskLocation from a string returned by getPreferredLocations.
   * These strings have the form [hostname] or hdfs_cache_[hostname], depending on whether the
   * location is cached.
   */
  def apply(str: String): TaskLocation = {
    val hstr = str.stripPrefix(inMemoryLocationTag)
    if (hstr.equals(str)) {
      new HostTaskLocation(str)
    } else {
      new HostTaskLocation(hstr)
    }
  }
}
这个数据结构通过节点的IP地址和Executor id确定出Task的位置

计算方法为:在Stage1某个TaskLocation上partition Shuffle后数据占Stage2 partition所有数据的比率大于REDUCER_PREF_LOCS_FRACTION,则这个TaskLocation会作为Stage2任务的TaskLocation

以上图"Map2 reduce2长度"为例:(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) > REDUCER_PREF_LOCS_FRACTION则使用Map2在Stage1所在节点的IP地址和Executor Id创建TaskLocation

通过TaskLocation的定义可以知道实际创建的TaskLocation类型是ExecutorCacheLocation

具体代码为:

private def getPreferredLocsInternal(
      rdd: RDD[_],
      partition: Int,
      visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
    // If the partition has already been visited, no need to re-visit.
    // This avoids exponential path exploration.  SPARK-695
    if (!visited.add((rdd, partition))) {
      // Nil has already been returned for previously visited partitions.
      return Nil
    }
    // If the partition is cached, return the cache locations
    val cached = getCacheLocs(rdd)(partition)
    if (cached.nonEmpty) {
      return cached
    }
    // If the RDD has some placement preferences (as is the case for input RDDs), get those
    val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
    if (rddPrefs.nonEmpty) {
      return rddPrefs.map(TaskLocation(_))
    }

    rdd.dependencies.foreach {
      case n: NarrowDependency[_] =>
        // If the RDD has narrow dependencies, pick the first partition of the first narrow dep
        // that has any placement preferences. Ideally we would choose based on transfer sizes,
        // but this will do for now.
        for (inPart <- n.getParents(partition)) {
          val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
          if (locs != Nil) {
            return locs
          }
        }
      case s: ShuffleDependency[_, _, _] =>
        // For shuffle dependencies, pick locations which have at least REDUCER_PREF_LOCS_FRACTION
        // of data as preferred locations
        if (shuffleLocalityEnabled &&
            rdd.partitions.size < SHUFFLE_PREF_REDUCE_THRESHOLD &&
            s.rdd.partitions.size < SHUFFLE_PREF_MAP_THRESHOLD) {
          // Get the preferred map output locations for this reducer
          /*
          *根据Stage1 shuffle map操作各个任务的返回值确定Stage2 shuffle reduce操作各个任务的本地性
          *在Stage1某个节点partition Shuffle后数据占Stage2 partition所有数据的比率大于REDUCER_PREF_LOCS_FRACTION(默认0.2),则这个
          * 节点会作为Stage2任务的启动节点
          * */
          val topLocsForReducer = mapOutputTracker.getLocationsWithLargestOutputs(s.shuffleId,
            partition, rdd.partitions.size, REDUCER_PREF_LOCS_FRACTION)
          if (topLocsForReducer.nonEmpty) {
            return topLocsForReducer.get.map(loc => TaskLocation(loc.host, loc.executorId))
          }
        }

      case _ =>
    }
    Nil
  }

DAGScheduler.submitMissingTasks会根据TaskLocation创建ShuffleMapTask或者ResultTask,如果Shuffle后的数据,没有能够满足条件创建ExecutorCacheLocation,则在创建ShuffleMapTask或者ResultTask的时候,传入的locs参数为Nil

创建完某个Stage的任务之后,然后调用TaskSchedulerImpl.submitTasks创建TaskSetManager,在创建TaskSetManager的时候会根据ShuffleMapTask或者ResultTask的TaskLocality加入到不同的HashMap中,代码如下:

<pre name="code" class="java"> private def addPendingTask(index: Int, readding: Boolean = false) {
    // Utility method that adds `index` to a list only if readding=false or it's not already there
    def addTo(list: ArrayBuffer[Int]) {
      if (!readding || !list.contains(index)) {
        list += index
      }
    }

    for (loc <- tasks(index).preferredLocations) {//preferredLocation方法返回partition所在的IP地址和Executor id
      loc match {
        case e: ExecutorCacheTaskLocation =>
          /*
          * 在创建ShuffleMapTask或者ResultTask的时候,如果这个Task传入的locs参数类型为ExecutorCacheTaskLocation则将Task id加入到pendingTasksForExecutor
          * pendingTasksForExecutor HashMap的key是Executor id, value是这个Executor对应的Task的id,可以有多个Task
          * */
          addTo(pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer))
        case e: HDFSCacheTaskLocation => {
          val exe = sched.getExecutorsAliveOnHost(loc.host)
          exe match {
            case Some(set) => {
              for (e <- set) {
                addTo(pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer))
              }
              logInfo(s"Pending task $index has a cached location at ${e.host} " +
                ", where there are executors " + set.mkString(","))
            }
            case None => logDebug(s"Pending task $index has a cached location at ${e.host} " +
                ", but there are no executors alive there.")
          }
        }
        case _ => Unit
      }
      addTo(pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer))//由于DirectDStream方式的loc.host地址不属于Spark集群和HDFS集群,所以Task加到了这个HashMap
      for (rack <- sched.getRackForHost(loc.host)) {
        addTo(pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer))
      }
    }
    /*
    * 在创建ShuffleMapTask或者ResultTask的时候,如果这个Task传入的locs参数为Nil则将Task id加入到pendingTasksWithNoPrefs
    * pendingTasksWithNoPrefs类型是ArrayBuffe,它的每个元素的value是这Task的id
    * */
    if (tasks(index).preferredLocations == Nil) {
      addTo(pendingTasksWithNoPrefs)
    }

    if (!readding) {
      /*
      * 所有任务都会加入到addPendingTasks
      * */
      allPendingTasks += index  // No point scanning this whole list to find the old task there  所有的Task都会加入到这个HashMap,包括DirectDStream情况下的Task
    }
  }

 
从上面的代码可见,TaskLocation类型是ExecutorCacheLocation的ShuffleMapTask或者ResultTask会同时加入到pendingTasksForExecutor、pendingTasksForHost、allPendingTasks 中。而TaskLocation类型是Nil的ShuffleMapTask或者ResultTask会同时加入到pendingTasksForHost、allPendingTasks 中 

这里有一个问题需要注意:虽然相同的任务加入到了多个任务等待队列,但是不会出现任务的重复调度,具体原因会在下面任务调度的时候讲到。

TaskScheduler.submitTasks方法创建TaskSetManager之后,调用CoarseGrainedSchedulerBackend.reviveOffers申请执行Stage2的资源.

在CoarseGrainedSchedulerBackend.reviveOffers最终会调用到TaskSchedulerImpl.resourceOffer方法分配执行资源。代码如下:

def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
    // Mark each slave as alive and remember its hostname
    // Also track if new executor is added
    var newExecAvail = false
    for (o <- offers) {
      executorIdToHost(o.executorId) = o.host
      activeExecutorIds += o.executorId
      if (!executorsByHost.contains(o.host)) {
        executorsByHost(o.host) = new HashSet[String]()
        executorAdded(o.executorId, o.host)
        newExecAvail = true
      }
      for (rack <- getRackForHost(o.host)) {
        hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
      }
    }

    // Randomly shuffle offers to avoid always placing tasks on the same set of workers.
    val shuffledOffers = Random.shuffle(offers)
    // Build a list of tasks to assign to each worker.
    //tasks是一个链表,这个链表的一个元素是一个数组,数组的类型是ArrayBuffer[TaskDescription],数组的大小跟跟这个executor的cores个数相同
    val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))
    //availableCpus是一个List,这个List的每个元素表示一个worker上这个Executor可用的cpu个数
    val availableCpus = shuffledOffers.map(o => o.cores).toArray
    val sortedTaskSets = rootPool.getSortedTaskSetQueue //从rootPool里面拿到对应的TaskSet,会使用设置的调度算法返回TaskSet的顺序
    for (taskSet <- sortedTaskSets) {
      logDebug("parentName: %s, name: %s, runningTasks: %s".format(
        taskSet.parent.name, taskSet.name, taskSet.runningTasks))
      if (newExecAvail) {
        taskSet.executorAdded()//添加了新的Executor,重新计算任务的本地性
      }
    }

    // Take each TaskSet in our scheduling order, and then offer it each node in increasing order
    // of locality levels so that it gets a chance to launch local tasks on all of them.
    // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
    var launchedTask = false
    /*
    * 对于一个任务集合,优先执行PROCESS_LOCAL任务,最后执行ANY任务
    * */
    for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
      do {
        launchedTask = resourceOfferSingleTaskSet(
            taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
      } while (launchedTask)
    }

    if (tasks.size > 0) {
      hasLaunchedTask = true
    }
    return tasks
  }

对任务集中的任务进行资源分配优先集为:PROCESS_LOCAL>NODE_LOCAL>NO_PREF>RACK_LOCAL>ANY

TaskSchedulerImpl.resourceOffer最终要调用TaskSchedulerImpl.resouceOfferSingleTaskSet为一个任务集分配资源,TaskSchedulerImpl.resouceOfferSingleTaskSet方法轮询每个Executor,从任务等待队列中拿到需要执行任务id,给轮询到的Executor执行

 private def resourceOfferSingleTaskSet(
      taskSet: TaskSetManager,
      maxLocality: TaskLocality,
      shuffledOffers: Seq[WorkerOffer],
      availableCpus: Array[Int],
      tasks: Seq[ArrayBuffer[TaskDescription]]) : Boolean = {
    var launchedTask = false
    for (i <- 0 until shuffledOffers.size) {
      val execId = shuffledOffers(i).executorId
      val host = shuffledOffers(i).host
      if (availableCpus(i) >= CPUS_PER_TASK) {//按照cpu cores个数分配task
        try {
          for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {
            tasks(i) += task //将这个task放在了第i个worker(worker顺序已经shuffle了)
            val tid = task.taskId
            taskIdToTaskSetId(tid) = taskSet.taskSet.id//记录task所在的taskset
            taskIdToExecutorId(tid) = execId//记录task所在的executor
            executorsByHost(host) += execId
            availableCpus(i) -= CPUS_PER_TASK
            assert(availableCpus(i) >= 0)
            launchedTask = true
          }
        } catch {
          case e: TaskNotSerializableException =>
            logError(s"Resource offer failed, task set ${taskSet.name} was not serializable")
            // Do not offer resources for this task, but don't throw an error to allow other
            // task sets to be submitted.
            return launchedTask
        }
      }
    }
    return launchedTask
  }
TaskSchedulerImpl.resouceOfferSingleTaskSet方法调用了TaskSetManager.resourceOffer为任务集中的单个任务进行资源分配,TaskSetManager.resourceOffer调用TaskSetManager.dequeueTask执行具体的资源分配,代码如下:

 private def dequeueTask(execId: String, host: String, maxLocality: TaskLocality.Value)
    : Option[(Int, TaskLocality.Value, Boolean)] =
  {
    /*
    * 在创建ShuffleMapTask或者ResultTask的时候,如果这个Task传入的locs参数为类型则从pendingTasksForExecutor HashMap获取任务执行的id
    * */
    for (index <- dequeueTaskFromList(execId, getPendingTasksForExecutor(execId))) {
      return Some((index, TaskLocality.PROCESS_LOCAL, false))
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.NODE_LOCAL)) {//由于KafkaRDD partition所在的Ip地址跟Executor的IP地址不同,所以Task不能从这个HashMap获取
      for (index <- dequeueTaskFromList(execId, getPendingTasksForHost(host))) {
        return Some((index, TaskLocality.NODE_LOCAL, false))
      }
    }
    /*
   * 在创建ShuffleMapTask或者ResultTask的时候,如果这个Task传入的locs参数为Nil则从pendingTasksWithNoPrefs获取任务执行的id
   * */
    if (TaskLocality.isAllowed(maxLocality, TaskLocality.NO_PREF)) {
      // Look for noPref tasks after NODE_LOCAL for minimize cross-rack traffic
      for (index <- dequeueTaskFromList(execId, pendingTasksWithNoPrefs)) {
        return Some((index, TaskLocality.PROCESS_LOCAL, false))
      }
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.RACK_LOCAL)) {
      for {
        rack <- sched.getRackForHost(host)
        index <- dequeueTaskFromList(execId, getPendingTasksForRack(rack))
      } {
        return Some((index, TaskLocality.RACK_LOCAL, false))
      }
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.ANY)) {//KafkaRDD的处理Task从addPendingTask这个HashMap获取
      for (index <- dequeueTaskFromList(execId, allPendingTasks)) {
        return Some((index, TaskLocality.ANY, false))
      }
    }

    // find a speculative task if all others tasks have been scheduled
    dequeueSpeculativeTask(execId, host, maxLocality).map {
      case (taskIndex, allowedLocality) => (taskIndex, allowedLocality, true)}
  }

TaskSetManager.dequeueTask调用TaskSetManager.dequeueTaskFromList从任务等待链表中获取任务:

private def dequeueTaskFromList(execId: String, list: ArrayBuffer[Int]): Option[Int] = {
    var indexOffset = list.size
    while (indexOffset > 0) {
      indexOffset -= 1
      val index = list(indexOffset)
      if (!executorIsBlacklisted(execId, index)) {
        // This should almost always be list.trimEnd(1) to remove tail
        list.remove(indexOffset)
        /*
        * 若任务已经开始执行,copiesRunning(index)==1
        * 若任务已经成功完成,则successful(index) ==true
        * 通过这个判断避免任务重复执行
        * */
        if (copiesRunning(index) == 0 && !successful(index)) {
          return Some(index)
        }
      }
    }
    None
  }


对于本地性非常高的任务,比如ExecutorCacheLocation任务,它被添加到pendingTasksForExecutor pendingTasksForHost allPendingTasks 3个数据结构中, 若任务已经开始执行,copiesRunning(index)==1, 若任务已经成功完成,则successful(index) ==true, 通过判断copiesRunning(index) == 0 && !successful(index)避免任务重复执行




通过以上分析可以得出结论,

Stage2任务在哪个Executor执行分2中情况

情况1:创建Stage2任务的时候,传入的loc2类型是ExecutorCacheLocation,也就是(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) > REDUCER_PREF_LOCS_FRACTION,这种任务执行的Executor会在Map2所执行的Executor

情况2:创建Stage2任务的时候,传入的loc2类型是Nil,也就是(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) < REDUCER_PREF_LOCS_FRACTION,这种任务执行的Executor会随机分配一个Executor执行




  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值