spark 源码之---DAGScheduler 耐心看完之后绝对有收获，

最新推荐文章于 2021-12-15 23:30:08 发布

hankl1990

最新推荐文章于 2021-12-15 23:30:08 发布

阅读量297

点赞数

分类专栏：源码 spark 文章标签： spark

本文链接：https://blog.csdn.net/weixin_36630761/article/details/107867026

版权

spark 同时被 2 个专栏收录

69 篇文章 0 订阅

订阅专栏

源码

4 篇文章 0 订阅

订阅专栏

忍不住先上一段官方的注释，感觉真的很好，不用看博客，就看这个，最原汁原味，还准确~

我用的spark的版本是2.3

/**
 * The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of
 * stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a
 * minimal schedule to run the job. It then submits stages as TaskSets to an underlying
 * TaskScheduler implementation that runs them on the cluster. A TaskSet contains fully independent
 * tasks that can run right away based on the data that's already on the cluster (e.g. map output
 * files from previous stages), though it may fail if this data becomes unavailable.
 *
 * Spark stages are created by breaking the RDD graph at shuffle boundaries. RDD operations with
 * "narrow" dependencies, like map() and filter(), are pipelined together into one set of tasks
 * in each stage, but operations with shuffle dependencies require multiple stages (one to write a
 * set of map output files, and another to read those files after a barrier). In the end, every
 * stage will have only shuffle dependencies on other stages, and may compute multiple operations
 * inside it. The actual pipelining of these operations happens in the RDD.compute() functions of
 * various RDDs
 *
 * In addition to coming up with a DAG of stages, the DAGScheduler also determines the preferred
 * locations to run each task on, based on the current cache status, and passes these to the
 * low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being
 * lost, in which case old stages may need to be resubmitted. Failures *within* a stage that are
 * not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task
 * a small number of times before cancelling the whole stage.
 *
 * When looking through this code, there are several key concepts:
 *
 *  - Jobs (represented by [[ActiveJob]]) are the top-level work items submitted to the scheduler.
 *    For example, when the user calls an action, like count(), a job will be submitted through
 *    submitJob. Each Job may require the execution of multiple stages to build intermediate data.
 *
 *  - Stages ([[Stage]]) are sets of tasks that compute intermediate results in jobs, where each
 *    task computes the same function on partitions of the same RDD. Stages are separated at shuffle
 *    boundaries, which introduce a barrier (where we must wait for the previous stage to finish to
 *    fetch outputs). There are two types of stages: [[ResultStage]], for the final stage that
 *    executes an action, and [[ShuffleMapStage]], which writes map output files for a shuffle.
 *    Stages are often shared across multiple jobs, if these jobs reuse the same RDDs.
 *
 *  - Tasks are individual units of work, each sent to one machine.
 *
 *  - Cache tracking: the DAGScheduler figures out which RDDs are cached to avoid recomputing them
 *    and likewise remembers which shuffle map stages have already produced output files to avoid
 *    redoing the map side of a shuffle.
 *
 *  - Preferred locations: the DAGScheduler also computes where to run each task in a stage based
 *    on the preferred locations of its underlying RDDs, or the location of cached or shuffle data.
 *
 *  - Cleanup: all data structures are cleared when the running jobs that depend on them finish,
 *    to prevent memory leaks in a long-running application.
 *
 * To recover from failures, the same stage might need to run multiple times, which are called
 * "attempts". If the TaskScheduler reports that a task failed because a map output file from a
 * previous stage was lost, the DAGScheduler resubmits that lost stage. This is detected through a
 * CompletionEvent with FetchFailed, or an ExecutorLost event. The DAGScheduler will wait a small
 * amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for any lost
 * stage(s) that compute the missing tasks. As part of this process, we might also have to create
 * Stage objects for old (finished) stages where we previously cleaned up the Stage object. Since
 * tasks from the old attempt of a stage could still be running, care must be taken to map any
 * events received in the correct Stage object.
 *
 * Here's a checklist to use when making or reviewing changes to this class:
 *
 *  - All data structures should be cleared when the jobs involving them end to avoid indefinite
 *    accumulation of state in long-running programs.
 *
 *  - When adding a new data structure, update `DAGSchedulerSuite.assertDataStructuresEmpty` to
 *    include the new structure. This will help to catch memory leaks.
 */
private[spark]
class DAGScheduler(
    private[scheduler] val sc: SparkContext,
    private[scheduler] val taskScheduler: TaskScheduler,
    listenerBus: LiveListenerBus,
    mapOutputTracker: MapOutputTrackerMaster,
    blockManagerMaster: BlockManagerMaster,
    env: SparkEnv,
    clock: Clock = new SystemClock())
  extends Logging {

后面我会抽时间来分析具体的~

这个dagscheduler 是一个重量级的角色，作用很大，上面的注释可能有点多，我来稍微总结一下：

1：它是以阶段(stage)为导向的，一个高层次的调度（相对于taskscheduler）

2: dag （传说中的有向无环图）就是它负责计算的，追踪物化的rdd,以及stage 输出，job是如何运行的是它来规划的。

3：它把stage 包成taskset 发送给 TaskScheduler 让它来执行，

4：还有就是 preferred locations 也是这大哥决定的，这个就是那个那几个任务运行的本地化的级别

5：如果是shuflle过程中输出文件丢失了，那么之前的stage需要重新提交的时候，也是它负责，但是，stage内部的非shuffle导致的文件丢失，是TaskSchuduler 的职责范围。

2：这个DagScheduler 是何时被实例化的呢？

这个就是必须从一切的入口sparkcontext 里开看了，可以看到它从一开始就实例化了。是context的一个属性。

它是如何发挥作用盘活全局的？

context 里的runjob是所有action的入口，实际调用的就是dagscheduler 的runJob()

继续追踪下去：

在dagscheduler 里调用的是dagschuduler.submitJob() 最终返回的是一个waiter ,这个waiter 在job执行完毕之前保持阻塞。所以说正常情况下，job是串行执行的就是这个道理。

遇到一个eventprocessLoop【DAGSchedulerEventProcessLoop】，用来提交作业的

我们来追溯一下：

我们来翻译一下注释看看这个EventLoop是几个意思：

event loop 是用来接收从调用这那里过来的events的，并且用event thread 来处理所有的events.

其实就是个容器，不断的接收事件，然后处理，我们来看下：

内部的实现其实就是个双端队列。

new 了线程，用来执行任务：

抽象类EventLoop中的关键方法：onReceive 我们看看在【DAGSchedulerEventProcessLoop】是如何实现的

调用的是DAGSchedulerEventProcessLoop.doOnReceive,开启模式匹配：

我们提交任务肯定就是JobSubmitted了：

继续跟踪：

调用了dagScheduler.createResultStage() ,返回一个ResultStage

为什么先建了一个ResultStage呢？

因为我们触发的时候是从最后的一个stage计算的，在spark中是懒加载，所以在计算的时候是从后往前进行回溯，这个和形成DAG的方向是正好相反的。

handleJobSubmmited 方法的末尾

/** Submits stage, but first recursively submits any missing parents. */
  private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        val missing = getMissingParentStages(stage).sortBy(_.id)
        logDebug("missing: " + missing)
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          submitMissingTasks(stage, jobId.get)
        } else {
          for (parent <- missing) {
            submitStage(parent)
          }
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }

我们可以看到注释：

提交stage,但是他会递归的先提交那些他的没提交的父stage,因为stage是一层层的依赖的，只有上一个stage完成了之后，下一个依赖它的stage才可以运行，所以必须得先往前递归父stage.

跟踪进去一探究竟：
getMissingParentStages(stage: Stage): List[Stage] 获取到上层的父依赖，我总觉得这里missing翻译成丢失的不合适，所以我理解的是还没有遍历到的父stage。

private def getMissingParentStages(stage: Stage): List[Stage] = {
    val missing = new HashSet[Stage]
    val visited = new HashSet[RDD[_]]
    // We are manually maintaining a stack here to prevent StackOverflowError
    // caused by recursively visiting
    val waitingForVisit = new ArrayStack[RDD[_]]

//我把这部分代码给拿到上面来了，这样看着会更加的清晰明了~
//把stage 直接压栈
  waitingForVisit.push(stage.rdd)
    while (waitingForVisit.nonEmpty) {

//这里是对应的弹栈操作，然后只要是waitingForVisit不为空那么就一直调用visit方法
      visit(waitingForVisit.pop())
    }

//然后这里是 def visit()
// 这里放入的是 rdd
    def visit(rdd: RDD[_]) {
// 如果这个rdd被访问过，就加入到已经访问过的集合里
      if (!visited(rdd)) {
        visited += rdd

        val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
        if (rddHasUncachedPartitions) {

//遍历rdd的所有的依赖
          for (dep <- rdd.dependencies) {

//看这个依赖是宽依赖还是窄依赖
            dep match {

// 如果是牵涉到了shuffle 那么就要获取或者创建一个中间的角色：ShuffleMapStage
              case shufDep: ShuffleDependency[_, _, _] =>
                val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
// 如果mapStage 不可用,那么就放入到 未访问的集合里
                if (!mapStage.isAvailable) {
                  missing += mapStage
                }
//如果是窄依赖，直接放到了栈中
              case narrowDep: NarrowDependency[_] =>
                waitingForVisit.push(narrowDep.rdd)
            }
          }
        }
      }
    }
  
    missing.toList
  }

有三个集合得提一下：

//用来存放还没有访问呢的stage
val missing = new HashSet[Stage]
//这里用来存放已经访问过的RDD
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new ArrayStack[RDD[_]] 

这个栈用来防止递归访问的时候导致的栈溢出

我们再回到：

/** Submits stage, but first recursively submits any missing parents. */
  private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        val missing = getMissingParentStages(stage).sortBy(_.id)
        logDebug("missing: " + missing)

//如果所有的父stage都到位了
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
//这里就开始提交没有提交的tasks了
          submitMissingTasks(stage, jobId.get)
        } else {

// 否则的话就继续递归调用自身，直到拿到所有的父stage
          for (parent <- missing) {
            submitStage(parent)
          }
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }

我们得看下

 /** Called when stage's parents are available and we can now do its task. */
  private def submitMissingTasks(stage: Stage, jobId: Int) {
 
    stage match {
      case s: ShuffleMapStage =>
        outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
      case s: ResultStage =>
        outputCommitCoordinator.stageStart(
          stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
    }
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    } catch {
      case NonFatal(e) =>
        stage.makeNewStageAttempt(partitionsToCompute.size)
        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }

    stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)

// 这里就牵涉到了序列化的问题了
    var taskBinary: Broadcast[Array[Byte]] = null
    var partitions: Array[Partition] = null
    try {

      RDDCheckpointData.synchronized {

//这里对rdd，以及stage 的依赖都做了 做序列化，我们知道序列化就是为了传输做准备的
        taskBinaryBytes = stage match {
// 序列化的是 stage.rdd, stage.shuffleDep 因为mapstage只是一个中间过程，所以得带着它的依赖信息
          case stage: ShuffleMapStage =>
            JavaUtils.bufferToArray(
              closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
//如果是resultstage 的话 传输的是 stage.rdd, stage.func
          case stage: ResultStage =>
            JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
        }

        partitions = stage.rdd.partitions
      }

// 上面已经赋值的 taskBinaryBytes 被广播
      taskBinary = sc.broadcast(taskBinaryBytes)
    } catch {
      // In the case of a failure during serialization, abort the stage.
      case e: NotSerializableException =>
        abortStage(stage, "Task not serializable: " + e.toString, Some(e))
        runningStages -= stage

        // Abort execution
        return
      case NonFatal(e) =>
        abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }

    val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = partitions(id)
            stage.pendingPartitions += id

//new 了 ShuffleMapTask 越来越接近目标了，里面装着我们上面的广播变量 taksBinary
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId)
          }

        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = partitions(p)
            val locs = taskIdToLocations(id)
//道理同上
            new ResultTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
          }
      }
    } catch {
      case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }


//上面我们得到了一个Seq[Task[_]]
    if (tasks.size > 0) {

//开始提交task了激动~ taskScheduler 在此闪亮登场，new TaskSet
// taskScheduler 是一个trait 我们得看他的实现类，只有一个就是：TaskSchedulerImpl，这个是真正干活的那个
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
    } else {

      val debugString = stage match {
        case stage: ShuffleMapStage =>
          s"Stage ${stage} is actually done; " +
            s"(available: ${stage.isAvailable}," +
            s"available outputs: ${stage.numAvailableOutputs}," +
            s"partitions: ${stage.numPartitions})"
        case stage : ResultStage =>
          s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
      }
      logDebug(debugString)


//这里是提交那些等待的子stages , 之前我们说过，如果stage存在依赖，那么只有父完成之后子才可以运行
      submitWaitingChildStages(stage)
    }
  }

我们还需要看两个点：

TaskSchedulerImpl.submitTasks(taskSet: TaskSet) 来跟一下源码看下：

override def submitTasks(taskSet: TaskSet) {
    val tasks = taskSet.tasks
    logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
    this.synchronized {
// 创建tasksetmanager,我们可以看到一个参数maxTaskFailures，这个就是task提交的过程里最大允许重试的次数，我们之前也说过，如果说是在stage内部的非shuffle导致file丢失的任务的重试就是taskschueduler 来负责的，它负责的是task 这种粒度的，而非stage

//这里就是把taskset 包装成了 TaskSetManager 然后放到调度池里去
      val manager = createTaskSetManager(taskSet, maxTaskFailures)
      val stage = taskSet.stageId
      val stageTaskSets =
        taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
      stageTaskSets(taskSet.stageAttemptId) = manager
      val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
        ts.taskSet != taskSet && !ts.isZombie
      }
      if (conflictingTaskSet) {
        throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
          s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
      }

//这里又出现一个角色 schedulableBuilder 字面意思上可以知道，这个大哥应该和调度相关，它负责管理tasksetmanager,我们暂时不说他，往后继续走

//这里是吧TaskSetManager放入了调度池等待调度
      schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)

      if (!isLocal && !hasReceivedTask) {
        starvationTimer.scheduleAtFixedRate(new TimerTask() {
          override def run() {
            if (!hasLaunchedTask) {
              logWarning("Initial job has not accepted any resources; " +
                "check your cluster UI to ensure that workers are registered " +
                "and have sufficient resources")
            } else {
              this.cancel()
            }
          }
        }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
      }
      hasReceivedTask = true
    }

//这里的重量级的选手，backend【SchedulerBackend】 ,字面意思了解就是个后台的东西，后台的东西一般都是和通信这类的有关的。我们下面要来分析一下它。
    backend.reviveOffers()
  }

trait SchedulerBackend 它的实现类我们来看看：

  override def reviveOffers() {
//可以看到调用的是driverEndpoint 的send 方法，就是它 var driverEndpoint: RpcEndpointRef
//一个引用,send方法再往后跟就是netty的东西了，就不再涉及了
    driverEndpoint.send(ReviveOffers)
  }


 // Internal messages in driver
// 就是说要发一个消息出去，这个消息发给谁呢，也就是说谁来接收呢？？？？
  case object ReviveOffers extends CoarseGrainedClusterMessage

有人send那么必然有人要receive，所以又一个重量级的角色登场：

CoarseGrainedExecutorBackend

至此，两大核心backend悉数登场，我们来看下receive方法：

override def receive: PartialFunction[Any, Unit] = {

// 各种事件的匹配
    case RegisteredExecutor =>
      logInfo("Successfully registered with driver")
      try {
        executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
      } catch {
        case NonFatal(e) =>
          exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
      }

    case RegisterExecutorFailed(message) =>
      exitExecutor(1, "Slave registration failed: " + message)

// 这里有个运行task的消息
    case LaunchTask(data) =>
      if (executor == null) {
        exitExecutor(1, "Received LaunchTask command but executor was null")
      } else {
        val taskDesc = TaskDescription.decode(data.value)
        logInfo("Got assigned task " + taskDesc.taskId)
        executor.launchTask(this, taskDesc)
      }

    case KillTask(taskId, _, interruptThread, reason) =>
      if (executor == null) {
        exitExecutor(1, "Received KillTask command but executor was null")
      } else {
        executor.killTask(taskId, interruptThread, reason)
      }

    case StopExecutor =>
      stopping.set(true)
      logInfo("Driver commanded a shutdown")
      // Cannot shutdown here because an ack may need to be sent back to the caller. So send
      // a message to self to actually do the shutdown.
      self.send(Shutdown)

    case Shutdown =>
      stopping.set(true)
      new Thread("CoarseGrainedExecutorBackend-stop-executor") {
        override def run(): Unit = {
          // executor.stop() will call `SparkEnv.stop()` which waits until RpcEnv stops totally.
          // However, if `executor.stop()` runs in some thread of RpcEnv, RpcEnv won't be able to
          // stop until `executor.stop()` returns, which becomes a dead-lock (See SPARK-14180).
          // Therefore, we put this line in a new thread.
          executor.stop()
        }
      }.start()

    case UpdateDelegationTokens(tokenBytes) =>
      logInfo(s"Received tokens of ${tokenBytes.length} bytes")
      SparkHadoopUtil.get.addDelegationTokens(tokenBytes, env.conf)
  }

其实到这里不知道你是否一个疑问：我发送的是 ReviveOffers 为啥没看到def receive 里有呢？

因为这个ReviveOffers消息是发送给自己的，CoarseGrainedSchedulerBackend.DriverEndpoint class DriverEndpoint

我们来看下有没有什么start这类的方法，真有一个：onStart()，我们跟进去看一下：

override def onStart() {
      // Periodically revive offers to allow delay scheduling to work
      val reviveIntervalMs = conf.getTimeAsMs("spark.scheduler.revive.interval", "1s")

      reviveThread.scheduleAtFixedRate(new Runnable {
        override def run(): Unit = Utils.tryLogNonFatalError {
          Option(self).foreach(_.send(ReviveOffers))
        }
      }, 0, reviveIntervalMs, TimeUnit.MILLISECONDS)
    }

//receive方法
    override def receive: PartialFunction[Any, Unit] = {
      case StatusUpdate(executorId, taskId, state, data) =>
        scheduler.statusUpdate(taskId, state, data.value)
        if (TaskState.isFinished(state)) {
          executorDataMap.get(executorId) match {
            case Some(executorInfo) =>
              executorInfo.freeCores += scheduler.CPUS_PER_TASK
              makeOffers(executorId)
            case None =>
              // Ignoring the update since we don't know about the executor.
              logWarning(s"Ignored task status update ($taskId state $state) " +
                s"from unknown executor with ID $executorId")
          }
        }

// 匹配到了，这才是对的，我们看到是driver 自己给自己发的消息 ，哈哈 ~~~~
      case ReviveOffers =>
//调用的是makeoffers(),这一步其实就是给分配资源，直白点说就是分配executor,过滤掉被干掉的等不可以用的executor
        makeOffers()

      case KillTask(taskId, executorId, interruptThread, reason) =>
        executorDataMap.get(executorId) match {
          case Some(executorInfo) =>
            executorInfo.executorEndpoint.send(
              KillTask(taskId, executorId, interruptThread, reason))
          case None =>
            // Ignoring the task kill since the executor is not registered.
            logWarning(s"Attempted to kill task $taskId for unknown executor $executorId.")
        }

      case KillExecutorsOnHost(host) =>
        scheduler.getExecutorsAliveOnHost(host).foreach { exec =>
          killExecutors(exec.toSeq, replace = true, force = true)
        }

      case UpdateDelegationTokens(newDelegationTokens) =>
        executorDataMap.values.foreach { ed =>
          ed.executorEndpoint.send(UpdateDelegationTokens(newDelegationTokens))
        }

      case RemoveExecutor(executorId, reason) =>
        // We will remove the executor's state and cannot restore it. However, the connection
        // between the driver and the executor may be still alive so that the executor won't exit
        // automatically, so try to tell the executor to stop itself. See SPARK-13519.
        executorDataMap.get(executorId).foreach(_.executorEndpoint.send(StopExecutor))
        removeExecutor(executorId, reason)
    }

我们来看下makeoffers()

private def makeOffers() {
      // Make sure no executor is killed while some task is launching on it
      val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
        // Filter out executors under killing
        val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
        val workOffers = activeExecutors.map {
          case (id, executorData) =>

// WorkerOffer 就是可用的资源
            new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
        }.toIndexedSeq
        scheduler.resourceOffers(workOffers)
      }

//这里有个关键点：调用了launchTasks方法，运行任务
      if (!taskDescs.isEmpty) {
        launchTasks(taskDescs)
      }
    }

我们来看下launchTasks方法：

// Launch tasks returned by a set of resource offers

// 返回的是 TaskDescription 的序列
    private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
      for (task <- tasks.flatten) {

// 对task进行encode 操作 ，其实就是做序列化
        val serializedTask = TaskDescription.encode(task)
        if (serializedTask.limit() >= maxRpcMessageSize) {

//如果你看到了这里，恭喜你，你有机会来了解一个生产上常见的一个报错。我决定咱们来看看 TaskDescription 是什么东西，感觉有必要来理解一下
          scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
            try {
              var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
                "spark.rpc.message.maxSize (%d bytes). Consider increasing " +
                "spark.rpc.message.maxSize or using broadcast variables for large values."
              msg = msg.format(task.taskId, task.index, serializedTask.limit(), maxRpcMessageSize)
              taskSetMgr.abort(msg)
            } catch {
              case e: Exception => logError("Exception in error callback", e)
            }
          }
        }
        else {
          val executorData = executorDataMap(task.executorId)
          executorData.freeCores -= scheduler.CPUS_PER_TASK

          logDebug(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
            s"${executorData.executorHost}.")

//executorData 是关于executor的一系列的资源的载体，这里的send方法是这个重点，发送了一个关键的消息：LaunchTask,带着已经序列化好的task 要开始干活了
          executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
        }
      }
    }

不知道老铁在生产上是否遇到过和 spark.rpc.message.maxSize 这个超出了阈值而报错的情况。我们来进去看看到底什么情况。

//就是它了，你可以看到里面的属性  files jars 都是map结构的
private[spark] class TaskDescription(
    val taskId: Long,
    val attemptNumber: Int,
    val executorId: String,
    val name: String,
    val index: Int,    // Index within this task's TaskSet
    val addedFiles: Map[String, Long],
    val addedJars: Map[String, Long],
    val properties: Properties,
    val serializedTask: ByteBuffer) {

看下encode方法：

def encode(taskDescription: TaskDescription): ByteBuffer = {
    val bytesOut = new ByteBufferOutputStream(4096)
    val dataOut = new DataOutputStream(bytesOut)

    dataOut.writeLong(taskDescription.taskId)
    dataOut.writeInt(taskDescription.attemptNumber)
    dataOut.writeUTF(taskDescription.executorId)
    dataOut.writeUTF(taskDescription.name)
    dataOut.writeInt(taskDescription.index)

    // Write files.
    serializeStringLongMap(taskDescription.addedFiles, dataOut)

    // Write jars.
    serializeStringLongMap(taskDescription.addedJars, dataOut)

    // Write properties.
    dataOut.writeInt(taskDescription.properties.size())
    taskDescription.properties.asScala.foreach { case (key, value) =>
      dataOut.writeUTF(key)
      // SPARK-19796 -- writeUTF doesn't work for long strings, which can happen for property values
      val bytes = value.getBytes(StandardCharsets.UTF_8)
      dataOut.writeInt(bytes.length)
      dataOut.write(bytes)
    }

    // Write the task. The task is already serialized, so write it directly to the byte buffer.

//把已经序列化的task 写入到buffer里 这个是要用来在集群里传输的
    Utils.writeByteBuffer(taskDescription.serializedTask, bytesOut)

    dataOut.close()
    bytesOut.close()
    bytesOut.toByteBuffer
  }

我们可以看到各种写入，files jars 都给整进去了，先写到这里~

这个图太棒了：https://blog.csdn.net/zwgdft/article/details/88349295?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase 原文链接

发送LaunchTask 消息，这个消息发送给了谁？还记得上面说的吗？

CoarseGrainedExecutorBackend.receive

CoarseGrainedExecutorBackend里有个onStart方法，他是要和driver建立联系的。我们来看下recevie方法：

这里做了decode的操作也就是进行了反序列化的操作。对应上面的encode

 executor.launchTask(this, taskDesc) 这个就是开始运行task了

又一个重量级的角色Executor登场了。

// 看下Executor 的 launchTask方法。  
// TaskRunner extends Runnable ,把任务包装成了一个TaskRunner

def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
    val tr = new TaskRunner(context, taskDescription)
    runningTasks.put(taskDescription.taskId, tr)
    threadPool.execute(tr)
  }

//这里维护了所有的正在running的tasks
private val runningTasks = new ConcurrentHashMap[Long, TaskRunner]

//线程池，用来执行tasks
threadPool

hankl1990

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark 源码之---DAGScheduler 耐心看完之后绝对有收获，

忍不住先上一段官方的注释，感觉真的很好，不用看博客，就看这个，最原汁原味，还准确~/** * The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of * stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a * minim.
复制链接

扫一扫

专栏目录