Spark DAG的生成和Task的提交,启动,执行源码

我们在执行spark任务的时候,可能会好奇任务的执行流程是什么,dag是怎么生成的,task是在哪里启动的,driver和executor是怎么通信的,等等。下面我们通过一个简单的spark wordcount任务,来粗略了解下其中的奥秘。

SparkSession对象的创建

我们在开发spark作业的时候,首先会需要创建spark任务的入口类SparkSession的对象:

SparkSession spark =SparkSession.builder().appName("JavaWordCount").master("local").getOrCreate();

下面就分析下在创建这个对象过程中,发生了什么。

SparkSession.builder()方法会创建一个build对象,设置必要参数如app名字,master,之后调用getOrCreate方法,返回一个SparkSession对象。下面我们看下getOrCreate方法主要做了什么。

首先会创建一个SparkConf对象,然后设置一些刚刚传进来的比如app name,master等等的参数。

val sparkConf = new SparkConf()
options.foreach { case (k, v) => sparkConf.set(k, v) }

SparkContext的创建

接着是创建SparkContext对象:

SparkContext.getOrCreate(sparkConf)
setActiveContext(new SparkContext(config))

在创建SparkContext的过程中,会进行一些初始化,简单介绍一些重要的对象创建和配置:

对象创建
  1. 配置
    首先是会传进来刚刚创建的配置变量SparkConf,然后继续添加其他配置,
_conf = config.clone()

并且会先进行一些必要配置的检验,比如spark.master和spark.app.name这两个配置没有的话会直接报错。

if (!_conf.contains("spark.master")) {
  throw new SparkException("A master URL must be set in your configuration")
}
if (!_conf.contains("spark.app.name")) {
  throw new SparkException("An application name must be set in your configuration")
}
  1. driver
    接着设置了driver的host和port
_conf.set(DRIVER_HOST_ADDRESS, _conf.get(DRIVER_HOST_ADDRESS))
_conf.setIfMissing(DRIVER_PORT, 0)
  1. 文件
    接着就是获取jar包路径和文件路径
_jars = Utils.getUserJars(_conf)
_files = _conf.getOption(FILES.key).map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten

4.环境
spark执行环境的创建(cache, map output tracker, etc)

// Create the Spark execution environment (cache, map output tracker, etc)
_env = createSparkEnv(_conf, isLocal, listenerBus)
  1. 资源
_executorMemory = _conf.getOption(EXECUTOR_MEMORY.key)
  1. 通信
_heartbeatReceiver = env.rpcEnv.setupEndpoint(HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
  1. 调度器
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
调度器,心跳等的启动
  1. 心跳创建和启动

这个心跳的主要作用是用来手机内存指标。

   // create and start the heartbeater for collecting memory metrics
    _heartbeater = new Heartbeater(
      () => SparkContext.this.reportHeartBeat(_executorMetricsSource),
      "driver-heartbeater",
      conf.get(EXECUTOR_HEARTBEAT_INTERVAL))
    _heartbeater.start()
  1. taskScheduler启动
    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
    // constructor
    _taskScheduler.start()

还有其他工具的初始化和启动,暂且不讲。

SparkSession对象创建

SparkContext对象创建初始化完毕,返回sc对象

     val sparkContext = userSuppliedContext.getOrElse {
          // set a random app name if not given.
          if (!sparkConf.contains("spark.app.name")) {
            sparkConf.setAppName(java.util.UUID.randomUUID().toString)
          }

          // 创建sparkContext
          SparkContext.getOrCreate(sparkConf)
          // Do not update `SparkConf` for existing `SparkContext`, as it's shared by all sessions.
        }

接着创建SparkSession对象,把sc传进来,并且返回SparkSession对象。

        session = new SparkSession(sparkContext, None, None, extensions)

行动算子触发

在触发action算子后,后面具体怎么操作的,dag怎么生成的,让人好奇。

word count job的最后会调用action算子collect,返回结果到driver端,在collect方法里,会进行一番什么样的操作,需要研究一下

List<Tuple2<String, Integer>> output = counts.collect();

collect算子点进去,会看到rdd调用collect方法:

  def collect(): JList[T] =
    rdd.collect().toSeq.asJava

接着点进去collect方法,会看到sc调用runJob方法

  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

在org.apache.spark.SparkContext#runJob方法里,会看到传入了几个参数

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
  }
parameterfunctionality
rdd要计算的rdd
func用于计算rdd的函数
partitionsrdd的分区
再接着往里面点几层会看到是dagScheduler调用runJob方法。
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)

再点进去就是提交job了。

    val waiter: JobWaiter[U] = submitJob(rdd, func, partitions, callSite, resultHandler, properties)

在submitJob方法里,会真正的提交作业

    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter[U](this, jobId, partitions.size, resultHandler)
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      Utils.cloneProperties(properties)))

eventProcessLoop post完作业提交事件,然后就是接受event并进行处理:

  override def onReceive(event: DAGSchedulerEvent): Unit = {
    val timerContext = timer.time()
    try {
      doOnReceive(event)
    } finally {
      timerContext.stop()
    }
  }

在doOnReceive里面会进行模式匹配走job提交分支:

    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
      log.info("提交作业,处理提交的作业")
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

org.apache.spark.scheduler.DAGScheduler#handleJobSubmitted里会进行阶段的划分,这个方法会创建ResultStage,同时创建所有shuffleMapStage,也就是把整个job的stage全部创建完毕。

    var finalStage: ResultStage = null
    finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)

Note: 关于stage,分为shuffleMapStage和resultStage,resultStage是带有action算子的stage,也就是job里的最后一个stage,而job里的其他stage都是shuffleMapStage。shuffleMapStage不是相当于shuffle write和shuffle read来说的,shuffleMapStage是一个完整的stage,而shuffle过程分为write和read两部分,也就是shuffle前和shuffle后。至于为什么叫shuffleMapStage,是因为凡是划分stage必有shuffle过程,所以带上shuffle没毛病。一个job是有很多stage组成的,结构如shuffleMapStage1->shuffleMapStage2->shuffleMapStage3->resultStage。

划分阶段

org.apache.spark.scheduler.DAGScheduler#createResultStage,会先创建父所有的shuffleMapStage,返回一个stage的集合。

    val parents: List[Stage] = getOrCreateParentStages(rdd, jobId)
    
  /**
   * Get or create the list of parent stages for a given RDD.  The new Stages will be created with
   * the provided firstJobId.
   */
    private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
    val shuffleDependencies: mutable.HashSet[ShuffleDependency[_, _, _]] = getShuffleDependencies(rdd)
    val shuffleMapStages: List[ShuffleMapStage] = shuffleDependencies.map(getOrCreateShuffleMapStage(_, firstJobId)).toList
    shuffleMapStages
  }

org.apache.spark.scheduler.DAGScheduler#getOrCreateParentStages中,总共有两步:

  1. 获取所有的shuffle依赖
  2. 生成所有的shuffleMapStage

先看获取shuffle依赖的逻辑:

  private[scheduler] def getShuffleDependencies(
                                                 rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
    val parents = new HashSet[ShuffleDependency[_, _, _]]
    val visited: mutable.HashSet[RDD[_]] = new HashSet[RDD[_]]
    val waitingForVisit: ListBuffer[RDD[_]] = new ListBuffer[RDD[_]]
    waitingForVisit += rdd
    while (waitingForVisit.nonEmpty) {
      val toVisit: RDD[_] = waitingForVisit.remove(0)
      if (!visited(toVisit)) {
        visited += toVisit
        val dependencies: Seq[Dependency[_]] = toVisit.dependencies
        dependencies.foreach {
          case shuffleDep: ShuffleDependency[_, _, _] =>
            parents += shuffleDep
          case dependency =>
            waitingForVisit.prepend(dependency.rdd)
        }
      }
    }
    parents
  }

简单来说就是获取传进来的rdd的依赖,判断是不是shuffle依赖,如果不是再获取依赖的rdd,再获取这个rdd的依赖,判断是不是shuffle依赖,直到是shuffle依赖,则添加shuffle依赖到集合中,最后返回包含所有shuffle依赖的集合。

第二步就是获取所有的shuffleMapStage。在dag调度器中有一个shuffleIdToMapStage,里面保存着shuffle依赖id和shuffleMapStage的对应关系。而getOrCreateShuffleMapStage的主要作用就是根据shuffle依赖生成shuffleMapStage,然后添加到shuffleIdToMapStage里面去。

    /**
      * Mapping from shuffle dependency ID to the ShuffleMapStage that will generate the data for
      * that dependency. Only includes stages that are part of currently running job (when the job(s)
      * that require the shuffle stage complete, the mapping will be removed, and the only record of
      * the shuffle data will be in the MapOutputTracker).
      */
    private[scheduler] val shuffleIdToMapStage = new HashMap[Int, ShuffleMapStage]
     
    private def getOrCreateShuffleMapStage(
                                          shuffleDep: ShuffleDependency[_, _, _],
                                          firstJobId: Int): ShuffleMapStage = {
    shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
      case Some(stage) =>
        stage

      case None =>
        // Create stages for all missing ancestor shuffle dependencies.
        val ancesterShuffleDep: ListBuffer[ShuffleDependency[_, _, _]] = getMissingAncestorShuffleDependencies(shuffleDep.rdd)
        ancesterShuffleDep
          .foreach { ancesterDependency =>if (!shuffleIdToMapStage.contains(ancesterDependency.shuffleId)) createShuffleMapStage(ancesterDependency, firstJobId) }
        createShuffleMapStage(shuffleDep, firstJobId)
    }
  }

创建shuffleMapStage在org.apache.spark.scheduler.DAGScheduler#createShuffleMapStage方法中,

   val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)

在debug信息里,看到stage信息stageIdToStage变量里面确实存在两个shuffleMapStage和一个resultStage,这与代码逻辑里看到的是一致的,一个job所有stage除了最后一个resultStage外,其他都是shuffleMapStage。

stageIdToStage = {HashMap@9362} "HashMap" size = 3
 0 = {Tuple2@12636} "(2,ResultStage 2)"
 1 = {Tuple2@12637} "(1,ShuffleMapStage信息(ShuffleMapStageId=1, rdd=MapPartitionsRDD[8] at mapToPair at JavaWordCount.java:54, numTasks=1, parents=List(ShuffleMapStage信息(ShuffleMapStageId=0, rdd=MapPartitionsRDD[6] at mapToPair at JavaWordCount.java:50, numTasks=1, parents=List(), firstJobId=0, mapOutputTrackerMaster=org.apache.spark.MapOutputTrackerMaster@4ae59b3b, pendingPartitions=Set(), shuffleDep=org.apache.spark.ShuffleDependency@3bb1de8a, mapStageJobs=List(), numAvailableOutputs=0, isAvailable=false, findMissingPartitions=Vector(0))), firstJobId=0, mapOutputTrackerMaster=org.apache.spark.MapOutputTrackerMaster@4ae59b3b, pendingPartitions=Set(), shuffleDep=org.apache.spark.ShuffleDependency@2069dd1, mapStageJobs=List(), numAvailableOutputs=0, isAvailable=false, findMissingPartitions=Vector(0)))"
 2 = {Tuple2@12638} "(0,ShuffleMapStage信息(ShuffleMapStageId=0, rdd=MapPartitionsRDD[6] at mapToPair at JavaWordCount.java:50, numTasks=1, parents=List(), firstJobId=0, mapOutputTrackerMaster=org.apache.spark.MapOutputTrackerMaster@4ae59b3b, pendingPartitions=Set(), shuffleDep=org.apache.spark.ShuffleDependency@3bb1de8a, mapStageJobs=List(), numAvailableOutputs=0, isAvailable=false, findMissingPartitions=Vector(0)))"

Note: rdd.dependencies指的是当前rdd的依赖,可能依赖好多个rdd,所以返回的是个列表。而dependency.rdd指的是这个rdd所依赖的一个父rdd,看代码
rdd.dependencies.map(_.rdd)的意思就是rdd通过依赖关系获取所有的父rdd。所以依赖信息包含了这个rdd的生成过程,就是如何从最开始到现在的rdd的。根据rdd的依赖可以一直往前回溯,rdd->依赖->父rdd->依赖->爷rdd。。。一直到最开始。

创建完stage会在一个stageId->stage的map中添加该stage,

private[scheduler] val stageIdToStage: mutable.HashMap[Int, Stage] = new HashMap[Int, Stage]
stageIdToStage(id) = stage

以及在jobid->stageid的map中添加该stage。

private[scheduler] val jobIdToStageIds = new HashMap[Int, HashSet[Int]]
val stage: Stage = stages.head
stage.jobIds += jobId
jobIdToStageIds.getOrElseUpdate(jobId, new HashSet[Int]()) += stage.id

然后返回到org.apache.spark.scheduler.DAGScheduler#handleJobSubmitted,在创建完resultStage后,会创建一个job,然后把该job添加到一个activejob的map中,同时把该job添加到activeJob set里,最后是提交stage。

val job: ActiveJob = new ActiveJob(jobId, finalStage, callSite, listener, properties)
private[scheduler] val jobIdToActiveJob = new HashMap[Int, ActiveJob]
private[scheduler] val activeJobs = new HashSet[ActiveJob]
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
submitStage(finalStage)

提交阶段

org.apache.spark.scheduler.DAGScheduler#submitStage中,递归获取最早的shuffleMapStage,然后走submitMissingTasks。

  /** Submits stage, but first recursively submits any missing parents. */
  private def submitStage(stage: Stage): Unit = {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      log.debug(s"submitStage($stage (name=${stage.name};" +
        s"jobs=${stage.jobIds.toSeq.sorted.mkString(",")}))")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        val missing = getMissingParentStages(stage).sortBy(_.id)
        log.debug("missing: " + missing)
        if (missing.isEmpty) {
          log.info("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          submitMissingTasks(stage, jobId.get)
        } else {
          for (parent <- missing) {
            submitStage(parent)
          }
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }

debug信息显示submitMissingTasks传入的stage是第一个shuffleMapStage。

ShuffleMapStage信息(ShuffleMapStageId=0, rdd=MapPartitionsRDD[6] at mapToPair at JavaWordCount.java:50, numTasks=1, parents=List(), firstJobId=0, mapOutputTrackerMaster=MapOutputTrackerMaster(minSizeForBroadcast=524288, shuffleLocalityEnabled=true, SHUFFLE_PREF_MAP_THRESHOLD=1000, SHUFFLE_PREF_REDUCE_THRESHOLD=1000, REDUCER_PREF_LOCS_FRACTION=0.2, shuffleStatuses=Map(0 -> org.apache.spark.ShuffleStatus@c8d15f4, 1 -> org.apache.spark.ShuffleStatus@4323919e), maxRpcMessageSize=134217728, mapOutputRequests=[], isLocal=true, getNumCachedSerializedBroadcast=0, getEpoch=0), pendingPartitions=Set(), shuffleDep=ShuffleDependency信息(keyClassName=java.lang.Object, valueClassName=java.lang.Object, combinerClassName=Some(java.lang.Object), shuffleId=1, _rdd=MapPartitionsRDD[6] at mapToPair at JavaWordCount.java:50, partitioner=org.apache.spark.HashPartitioner@1, serializer=org.apache.spark.serializer.JavaSerializer@76d5a149, keyOrdering=None, aggregator=Some(Aggregator(org.apache.spark.rdd.PairRDDFunctions$$Lambda$1521/1852790850@73e25780,org.apache.spark.api.java.JavaPairRDD$$$Lambda$1519/493495005@1d535b78,org.apache.spark.api.java.JavaPairRDD$$$Lambda$1519/493495005@1d535b78)), mapSideCombine=true, shuffleWriterProcessor=org.apache.spark.shuffle.ShuffleWriteProcessor@610ef36f), mapStageJobs=List(), numAvailableOutputs=0, isAvailable=false, findMissingPartitions=Vector(0))

在submitMissingTasks中进行stage的提交,下面看下主要的逻辑:

  1. 获取要计算的分区和属性文件
    // Figure out the indexes of partition ids to compute.
    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

    // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
    // with this Stage
    val properties: Properties = jobIdToActiveJob(jobId).properties
  1. 把stage加入到活跃stage set中
 // Stages we are running right now
  private[scheduler] val runningStages = new HashSet[Stage]

runningStages += stage
  1. 获取分区文件所在的位置信息,然后在文件所在的节点启动task
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id)) }.toMap
        case s: ResultStage =>
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    } 
  1. 对stage进行序列化,然后进行广播
    var taskBinary: Broadcast[Array[Byte]] = null
    var partitions: Array[Partition] = null
    try {
      // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
      // For ResultTask, serialize and broadcast (rdd, func).
      var taskBinaryBytes: Array[Byte] = null
      // taskBinaryBytes and partitions are both effected by the checkpoint status. We need
      // this synchronization in case another concurrent job is checkpointing this RDD, so we get a
      // consistent view of both variables.
      RDDCheckpointData.synchronized {
        taskBinaryBytes = stage match {
          case stage: ShuffleMapStage =>
            JavaUtils.bufferToArray(
              closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
          case stage: ResultStage =>
            JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
        }

        partitions = stage.rdd.partitions
      }

      if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KIB * 1024) {
        log.warn(s"Broadcasting large task binary with size " +
          s"${Utils.bytesToString(taskBinaryBytes.length)}")
      }
      log.info("广播task")
      taskBinary = sc.broadcast(taskBinaryBytes)
    }
  1. 根据stage类型创建对应的tasks
 val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = partitions(id)
            stage.pendingPartitions += id
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
          }

        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
              stage.rdd.isBarrier())
          }
      }
    } 
  1. 提交tasks(stage)
    if (tasks.nonEmpty) {
      log.info(s"提交 ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
        s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
      log.info("taskScheduler提交task,传的是一批task,即一个taskSet")
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
    }

org.apache.spark.scheduler.TaskSchedulerImpl#submitTasks中,会创建taskSetManager,这个的主要功能就是调度一个taskSet里面所有的task,如果task失败会重试,并且通过延迟调度处理taskSet的位置感知调度。

/**
 * Schedules the tasks within a single TaskSet in the TaskSchedulerImpl. This class keeps track of
 * each task, retries tasks if they fail (up to a limited number of times), and
 * handles locality-aware scheduling for this TaskSet via delay scheduling. The main interfaces
 * to it are resourceOffer, which asks the TaskSet whether it wants to run a task on one node,
 * and handleSuccessfulTask/handleFailedTask, which tells it that one of its tasks changed state
 *  (e.g. finished/failed).
 *
 * THREADING: This class is designed to only be called from code with a lock on the
 * TaskScheduler (e.g. its event handlers). It should not be called from other threads.
 *
 * @param sched           the TaskSchedulerImpl associated with the TaskSetManager
 * @param taskSet         the TaskSet to manage scheduling for
 * @param maxTaskFailures if any particular task fails this number of times, the entire
 *                        task set will be aborted
 */
 val manager: TaskSetManager = createTaskSetManager(taskSet, maxTaskFailures)

调度器添加taskSetManager,调度器有FIFO调度和fair调度,

private var schedulableBuilder: SchedulableBuilder = null
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)

看FIFO调度模式里调度池会添加该taskSetManager

  override def addTaskSetManager(manager: Schedulable, properties: Properties): Unit = {
    rootPool.addSchedulable(manager)
  }

而rootPool就是个管理taskSetManager的。

/**
 * A Schedulable entity that represents collection of Pools or TaskSetManagers
 */
private[spark] class Pool(
    val poolName: String,
    val schedulingMode: SchedulingMode,
    initMinShare: Int,
    initWeight: Int)
  extends Schedulable with Logging {

调度池添加完taskSetManager后,就走到最后,复活task。

    backend.reviveOffers()

org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend#reviveOffers里,driver终端发送这个task。

  override def reviveOffers(): Unit = {
    log.info("发送消息")
    driverEndpoint.send(ReviveOffers)
  }

driverEndpoint就是在executor创建的终端

  val driverEndpoint: RpcEndpointRef = rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint())

接收task

driverEndpoint有send方法,还会有receive方法,里面走模式匹配

      case ReviveOffers =>
        makeOffers()

则继续点进去makeOffers,可见:

  1. 首先获取可用的executor;
  2. 遍历每个executor获取可用资源信息;
  3. rm提供任务所需资源;
  4. 启动task。

代码太过经典,所以都抄下来了

            // Make fake resource offers on all executors
    private def makeOffers(): Unit = {
      // Make sure no executor is killed while some task is launching on it
      val taskDescs = withLock {
        // Filter out executors under killing
        log.info("过滤出还在活跃,且没有在被杀死过程的executor")
        val activeExecutors = executorDataMap.filterKeys(isExecutorActive)
        val workOffers: IndexedSeq[WorkerOffer] = activeExecutors.map {
          case (id, executorData) =>
            val workerOffer: WorkerOffer = new WorkerOffer(
              id,
              executorData.executorHost,
              executorData.freeCores,
              Some(executorData.executorAddress.hostPort),
              executorData.resourcesInfo.map { case (rName, rInfo) => (rName, rInfo.availableAddrs.toBuffer) }
            )
            log.debug("获取资源信息{}",workerOffer)
            workerOffer
        }.toIndexedSeq
        val taskDesc: Seq[Seq[TaskDescription]] = scheduler.resourceOffers(workOffers)
        taskDesc
      }
      if (taskDescs.nonEmpty) {
        log.info("启动tasks")
        launchTasks(taskDescs)
      }
    }

在第三步提供资源这里,

  val taskDesc: Seq[Seq[TaskDescription]] = scheduler.resourceOffers(workOffers)

resourceOffers主要逻辑有:

  1. 添加host->executor->task的映射关系;
        hostToExecutors(o.host) += o.executorId
        executorIdToHost(o.executorId) = o.host
        executorIdToRunningTaskIds(o.executorId) = HashSet[Long]()
  1. 添加host的rack信息;
  for ((host, Some(rack)) <- hosts.zip(getRacksForHosts(hosts))) {
      hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += host
    }
  1. 过滤出不在节点黑名单的executor;
    val filteredOffers = blacklistTrackerOpt.map { blacklistTracker =>
      offers.filter { offer =>
        !blacklistTracker.isNodeBlacklisted(offer.host) &&
          !blacklistTracker.isExecutorBlacklisted(offer.executorId)
      }
    }.getOrElse(offers)
  1. 重排序以避免task都在一个worker上;
    val shuffledOffers = shuffleOffers(filteredOffers)

  1. 创建一些资源信息;
    // Build a list of tasks to assign to each worker.
    val tasks: IndexedSeq[ArrayBuffer[TaskDescription]] = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores / CPUS_PER_TASK))
    val availableResources: Array[Map[String, mutable.Buffer[String]]] = shuffledOffers.map(_.resources).toArray
    val availableCpus: Array[Int] = shuffledOffers.map(o => o.cores).toArray
 
  1. 从调度池获取taskSetManager,对taskSetManager排序用的算法有FIFO和Fair;
    val sortedTaskSets: ArrayBuffer[TaskSetManager] = rootPool.getSortedTaskSetQueue.filterNot(_.isZombie)
    
      override def getSortedTaskSetQueue: ArrayBuffer[TaskSetManager] = {
    val sortedTaskSetQueue = new ArrayBuffer[TaskSetManager]
    val sortedSchedulableQueue =
      schedulableQueue.asScala.toSeq.sortWith(taskSetSchedulingAlgorithm.comparator)
    for (schedulable <- sortedSchedulableQueue) {
      sortedTaskSetQueue ++= schedulable.getSortedTaskSetQueue
    }
    sortedTaskSetQueue
  }
  
    private val taskSetSchedulingAlgorithm: SchedulingAlgorithm = {
    schedulingMode match {
      case SchedulingMode.FAIR =>
        new FairSchedulingAlgorithm()
      case SchedulingMode.FIFO =>
        new FIFOSchedulingAlgorithm()
      case _ =>
        val msg = s"Unsupported scheduling mode: $schedulingMode. Use FAIR or FIFO instead."
        throw new IllegalArgumentException(msg)
    }
  }

  1. 按照调读顺序,遍历所有taskSetManager,还有其他关于barrier得判断就不看了,看主要逻辑,是判断是否启动了task;
   var launchedAnyTask: Boolean = false
        // Record all the executor IDs assigned barrier tasks on.
        val addressesWithDescs: ArrayBuffer[(String, TaskDescription)] = ArrayBuffer[(String, TaskDescription)]()
        for (currentMaxLocality <- taskSet.myLocalityLevels) {
          var launchedTaskAtCurrentMaxLocality: Boolean = false
          do {
            launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(taskSet,
              currentMaxLocality, shuffledOffers, availableCpus,
              availableResources, tasks, addressesWithDescs)
            launchedAnyTask |= launchedTaskAtCurrentMaxLocality
          } while (launchedTaskAtCurrentMaxLocality)
        }

主要逻辑在org.apache.spark.scheduler.TaskSchedulerImpl#resourceOfferSingleTaskSet,会有两层遍历,外层是rm提供的可用资源(用于启动task的container吧),里面是所以的taskSet的task进行遍历,然后将task以及资源信息封装到task描述器里面

 for (task <- taskSet.resourceOffer(execId, host, maxLocality, availableResources(i))) {
            tasks(i) += task
            val tid = task.taskId
            taskIdToTaskSetManager.put(tid, taskSet)
            taskIdToExecutorId(tid) = execId
            executorIdToRunningTaskIds(execId).add(tid)
            availableCpus(i) -= CPUS_PER_TASK
            assert(availableCpus(i) >= 0)
            task.resources.foreach { case (rName, rInfo) =>
              // Remove the first n elements from availableResources addresses, these removed
              // addresses are the same as that we allocated in taskSet.resourceOffer() since it's
              // synchronized. We don't remove the exact addresses allocated because the current
              // approach produces the identical result with less time complexity.
              //  availableResources: Array[Map[String, Buffer[String]]],
              availableResources(i)(rName).remove(0, rInfo.addresses.size)
            }
            

org.apache.spark.scheduler.TaskSetManager#resourceOffer里面就是进行任务和资源的封装:

//资源信息
        val extraResources = sched.resourcesReqsPerTask.map { taskReq =>
          val rName = taskReq.resourceName
          val count = taskReq.amount
          val rAddresses = availableResources.getOrElse(rName, Seq.empty)
          assert(rAddresses.size >= count, s"Required $count $rName addresses, but only " +
            s"${rAddresses.size} available.")
          // We'll drop the allocated addresses later inside TaskSchedulerImpl.
          val allocatedAddresses = rAddresses.take(count)
          (rName, new ResourceInformation(rName, allocatedAddresses.toArray))
        }.toMap
        
        
//task序列化
  val serializedTask: ByteBuffer = try {
          ser.serialize(task)
        }
          val task = tasks(index)
          
//返回taskDesc

        new TaskDescription(
          taskId,
          attemptNum,
          execId,
          taskName,
          index,
          task.partitionId,
          addedFiles,
          addedJars,
          task.localProperties,
          extraResources,
          serializedTask)
  1. 返回task

启动task

返回到org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint#makeOffers方法,在最后启动task描述器,

      if (taskDescs.nonEmpty) {
        log.info("启动tasks")
        launchTasks(taskDescs)
      }

org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint#launchTasks方法中,对所有task描述器进行遍历。先把taskDesc序列化,然后获取task执行所在的executor,在获取该executor的资源数据,并且减去执行该task所需要的cpu核数,请求所需要的资源,最后终端发送task。

val serializedTask: ByteBuffer = TaskDescription.encode(task)
val executorData: ExecutorData = executorDataMap(task.executorId)
executorData.freeCores -= scheduler.CPUS_PER_TASK
task.resources.foreach { case (rName, rInfo) =>
  assert(executorData.resourcesInfo.contains(rName))
  executorData.resourcesInfo(rName).acquire(rInfo.addresses)
}
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))

container启动command语句启动主类是org.apache.spark.executor.YarnCoarseGrainedExecutorBackend,里面调用org.apache.spark.executor.CoarseGrainedExecutorBackend#run方法,创建executor终端。终端的receive方法模式匹配会走到启动task的逻辑

      val backend: CoarseGrainedExecutorBackend = backendCreateFn(env.rpcEnv, arguments, env, cfg.resourceProfile)
      env.rpcEnv.setupEndpoint("Executor", backend)
      
      
          case LaunchTask(data) =>
      if (executor == null) {
        exitExecutor(1, "Received LaunchTask command but executor was null")
      } else {
        log.info("executor接收到task先进行反序列化")
        val taskDesc: TaskDescription = TaskDescription.decode(data.value)
        log.info("Got assigned task " + taskDesc.taskId)
        taskResources(taskDesc.taskId) = taskDesc.resources
        log.info("终于启动task了,描述",taskDesc.toString)
        executor.launchTask(this, taskDesc)
      }

在executor.launchTask里启动task。会先创建一个线程,把taskId和线程put到一个map中,

  // Maintains the list of running tasks.
  private val runningTasks = new ConcurrentHashMap[Long, TaskRunner]
  
val tr: TaskRunner = new TaskRunner(context, taskDescription)
runningTasks.put(taskDescription.taskId, tr)

最后worker节点的线程池启动该线程,task最终启动。

  private val threadPool = {
    val threadFactory = new ThreadFactoryBuilder()
      .setDaemon(true)
      .setNameFormat("Executor task launch worker-%d")
      .setThreadFactory((r: Runnable) => new UninterruptibleThread(r, "unused"))
      .build()
    Executors.newCachedThreadPool(threadFactory).asInstanceOf[ThreadPoolExecutor]
  }
  
  threadPool.execute(tr)

至此task最终在executor上启动。

执行Task

TaskRunner线程会执行run方法,在run方法里,先将task反序列化,再运行task

task = ser.deserialize[Task[Any]](
          taskDescription.serializedTask, Thread.currentThread.getContextClassLoader)
          
          
val res = task.run(
 taskAttemptId = taskId,
 attemptNumber = taskDescription.attemptNumber,
 metricsSystem = env.metricsSystem,
 resources = taskDescription.resources)

Task类是个抽象类,子类有ShuffleMapTask和ResultTask,在进入task.run()方法后,会走入相应的子类的runTask方法。

    val taskContext: TaskContextImpl = new TaskContextImpl(
      stageId,
      stageAttemptId, // stageAttemptId and stageAttemptNumber are semantically equal
      partitionId,
      taskAttemptId,
      attemptNumber,
      taskMemoryManager,
      localProperties,
      metricsSystem,
      metrics,
      resources)
 runTask(context)

在shuffleMapTask的runTask方法,会先获取rdd和依赖,然后开始写出数据,

    val (rdd,dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
    dep.shuffleWriterProcessor.write(rdd, dep, mapId, context, partition)

而在ResultTask的runTask方法,同样获取rdd和要计算的函数,接着开始计算并且返回结果。在rdd.iterator方法中,会走到computeOrReadCheckpoint,在里面进行计算并且返回结果。

    val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
    val result: U = func(context, rdd.iterator(partition, context))

   final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {
      getOrCompute(split, context)
    } else {
      computeOrReadCheckpoint(split, context)
    }
  }
  
    private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
  {
    computeOrReadCheckpointCount+=1
    if (isCheckpointedAndMaterialized) {
      firstParent[T].iterator(split, context)
    } else {
      val preRDD = dependencies.map( dep=> if( dep.rdd != null ) dep.rdd.id else -1).filter(_ != -1).mkString(",")
      log.debug("rdd id为{},依赖rdd为{},第{}次读取并且计算,分区信息为{},{}",this.id.toString,preRDD,computeOrReadCheckpointCount.toString,split.toString,context.toString)
      val iterator: Iterator[T] = compute(split, context)
      val list = iterator.toList
      log.debug("返回{}条计算结果:{}",list.size,list.take(20).zipWithIndex.map(r=>"第"+r._2+"条数据: "+r._1).mkString("; "))
      list.iterator
    }
  }

获取计算结果后,先对结果进行序列化,然后把结果发送到driver端;

      val value = Utils.tryWithSafeFinally {
          log.debug("运行{}",task)
          val res = task.run(
            taskAttemptId = taskId,
            attemptNumber = taskDescription.attemptNumber,
            metricsSystem = env.metricsSystem,
            resources = taskDescription.resources)
          threwException = false
          res
        }
        
    val valueBytes = resultSer.serialize(value)
    val directResult = new DirectTaskResult(valueBytes, accumUpdates, metricPeaks)
    val serializedDirectResult = ser.serialize(directResult)
    
    execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)

org.apache.spark.executor.CoarseGrainedExecutorBackend#statusUpdate里,driver终端向driver发送结果,至此一个task执行完成并将结果返回到driver。

override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer): Unit = {
    val resources = taskResources.getOrElse(taskId, Map.empty[String, ResourceInformation])
    val msg = StatusUpdate(executorId, taskId, state, data, resources)
    if (TaskState.isFinished(state)) {
      taskResources.remove(taskId)
    }
    driver match {
      case Some(driverRef) => driverRef.send(msg)
      case None => log.warn(s"Drop $msg because has not yet connected to driver")
    }
  }

  • 27
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值