以Wordcount为例分析Spark应用执行流程

WordCount

word count是spark 最基本的小程序,主要功能就是统计一个文件里面各个单词出现的个数。代码很简洁,如下。

package swjtu.cn.mi
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
 def main(args: Array[String]): Unit = {
  // 1.env/准备sc/SparkContext/Spark上下文执行环境
  val conf: SparkConf = new SparkConf().setAppName("wc").setMaster("local[*]")
  val sc: SparkContext = new SparkContext(conf)
  sc.setLogLevel("WARN")
  //2.source/读取数据
  //RDD:A Resilient Distributed Dataset (RDD):弹性分布式数据集,简单理解为分布式集合!使用起来和普通集合一样简单!
  //RDD[就是一行行的数据]
  val lines: RDD[String] = sc.textFile("data/input/words.txt")
  //3.transformation/数据操作/转换
  //切割:RDD【一个一个的单词】
  val words: RDD[String] = lines.flatMap(_.split(" "))
  // 记为1:RDD【(单词, 1)】
  val wordAndOnes: RDD[(String, Int)] = words.map((_,1))
  //分组聚合:groupBy + mapValues(_.map(_._2).reduce(_+_)) ===>在Spark里面分组+聚合一步搞定:reduceByKey
  val result: RDD[(String, Int)] = wordAndOnes.reduceByKey(_+_)
  //4.sink/输出
  //直接输出
  result.foreach(*println*)
  //收集为本地集合再输出
  //*println*(result.collect().toBuffer)
  //输出到指定path(可以是文件/夹)
  //result.repartition(1).saveAsTextFile("data/output/result")
  //为了便于查看Web-UI可以让程序睡一会
  //Thread.sleep(1000 * 6000)
  //*TODO 5.关闭资源*
  sc.stop()
 }
}

理论剖析

里面的RDD链,就是testFile -> flatMap -> map -> reduceBykey -> foreach

spark里面有两种操作,action和transformation, 其中action会触发提交job的操作,transformation不会触发job,只是进行rdd的转换。而不同transformation操作的rdd链两端的依赖关系也不同,spark中的rdd依赖有两种,分别是narrow dependency 和 wide dependency ,这两种依赖如下图所示。

img

左边图是窄依赖,右边图是宽依赖,窄依赖里面的partition的对应顺序是不变的,宽依赖会涉及shuffle操作,会造成partition混洗,因此往往以宽依赖划分stage。在上面的操作中,foreach是action,reduceByKey是宽依赖,因此这个应用总共有1个job,两个stage,然后在不同的stage中会执行tasks。

源码剖析

从rdd链开始分析

 def textFile(
       path: String,
       minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
 assertNotStopped()
 hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
  minPartitions).map(pair => pair._2.toString).setName(path)
}

textFile 这个算子的返回结果是一个RDD,然后RDD链就开始了,可以看出来他调用了一些新的函数,比如hadoopFile啥的,这些我们都不管,因为他们都没有触发 commitJob,所以这些中间过程我们就省略,直到println这个action。

提交job


def foreach(f: T => Unit): Unit = withScope {
 val cleanF = sc.clean(f)
 sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}

一直点进去runJob可以看到如下代码


def runJob[T, U: ClassTag](
  rdd: RDD[T],
  func: (TaskContext, Iterator[T]) => U,
  partitions: Seq[Int],
  resultHandler: (Int, U) => Unit): Unit = {
 if (stopped.get()) {
  throw new IllegalStateException("SparkContext has been shutdown")
 }
 val callSite = getCallSite
 val cleanedFunc = clean(func)
 logInfo("Starting job: " + callSite.shortForm)
 if (conf.getBoolean("spark.logLineage", false)) {
  logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
 }
 dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
 progressBar.foreach(_.finishAll())
 rdd.doCheckpoint()
}

DAGScheduler类内部会进行一系列的方法调用,首先是在runJob方法里调用submitJob方法来继续提交作业,这里会发生阻塞,直到返回作业完成或者失败;具体的实现是创建一个JobWaiter对象,并借助内部消息处理进行把这个对象发送给DAGScheduler的内嵌类DAGSchedulerEventProcessLoop(继承EventLoop)进行处理;最后在DAGSchedulerEventProcessLoop消息接收方法onReceive中,接收到JobSubmitted样例类完成匹配后,继续调用DAGScheduler 的handleJobSubmitted方法来提交作业,在该方法中进行划分阶段


def submitJob[T, U](
  rdd: RDD[T],
  func: (TaskContext, Iterator[T]) => U,
  partitions: Seq[Int],
  callSite: CallSite,
  resultHandler: (Int, U) => Unit,
  properties: Properties): JobWaiter[U] = {
 // Check to make sure we are not launching a task on a partition that does not exist.
 val maxPartitions = rdd.partitions.length
 partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
  throw new IllegalArgumentException(
   "Attempting to access a non-existent partition: " + p + ". " +
    "Total number of partitions: " + maxPartitions)
 }
 val jobId = nextJobId.getAndIncrement()
 // 如何作业包含0个任务,则创建0个任务的JobWaiter返回,立即返回
 if (partitions.isEmpty) {
  val clonedProperties = Utils.cloneProperties*(properties)
  if (sc.getLocalProperty(SparkContext.SPARK_JOB_DESCRIPTION) == null) {
   clonedProperties.setProperty(SparkContext.SPARK_JOB_DESCRIPTION, callSite.shortForm)
  }
  val time = clock.getTimeMillis()
  listenerBus.post(
   *SparkListenerJobStart(jobId, time, Seq.empty, clonedProperties))
  listenerBus.post(
   SparkListenerJobEnd(jobId, time, JobSucceeded))
  // Return immediately if the job is running 0 tasks
  return new JobWaiter[U](this, jobId, 0, resultHandler)
 }
 *assert*(partitions.nonEmpty)
//创建Jobwaiter对象,等待作业运行完毕,使用内部类提交作业
 val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
 val waiter = new JobWaiter[U](this, jobId, partitions.size, resultHandler)
 eventProcessLoop.post(JobSubmitted*(
  jobId, rdd, func2, partitions.toArray, callSite, waiter,
  Utils.cloneProperties*(properties)))
 waiter
}

划分stage

提交完job之后,会对stage进行划分。本案例中划分为两个stage,stage0的操作是获取文件、flatMap、map等,然后action的reduceByKey分别划分为一个stage

spark调度阶段是由DAGScheduler实现的,DAGScheduler会从最后一个RDD出发使用广度优先遍历整个依赖树,从而划分调度阶段,划分的依据是是否发生shuffleDependency(宽依赖),如果当某个RDD发生Shuffle时,以该shuffle为依据前后两个阶段划分两个stage。

代码实现是在DAGScheduler 的handleJobSubmitted方法中根据最后一个RDD生成的ResultStage开始的,具体方法从getOrCreateParentStages找出其依赖的祖先RDD是否存在shuffle操作,如果没有存在shuffle操作,本次作业仅有一个ResultStage,该ResultStage不存在父调度阶段;如果存在shuffle操作,则本次作业存在一个ResultStage和至少一个ShuffleMapstage,其中handleJobSubmitted代码如下

private[scheduler] def handleJobSubmitted(jobId: Int,
                     finalRDD: RDD[_],
                     func: (TaskContext, Iterator[_]) => _,
                     partitions: Array[Int],
                     callSite: CallSite,
                     listener: JobListener,
                     properties: Properties) {
 var finalStage: ResultStage = null
 try {
  // New stage creation may throw an exception if, for example, jobs are run on a
  // HadoopRDD whose underlying HDFS files have been deleted.
//根据最后一个RDD回溯,获取最后一个调度阶段finalStage
  finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
 } catch {......}
//根据最后一个调度阶段finalStage生成作用
 val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
 clearCacheLocs()
 logInfo("Got job %s (%s) with %d output partitions".format(
  job.jobId, callSite.shortForm, partitions.length))
 logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
 logInfo("Parents of final stage: " + finalStage.parents)
 logInfo("Missing parents: " + getMissingParentStages(finalStage))
 val jobSubmissionTime = clock.getTimeMillis()
 jobIdToActiveJob(jobId) = job
 activeJobs += job
 finalStage.setActiveJob(job)
 val stageIds = jobIdToStageIds(jobId).toArray
 val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
 listenerBus.post(
  SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
//提交执行  
submitStage(finalStage)
 submitWaitingStages()
}
****
 ** Create a ResultStage associated with the provided jobId.*
 **/*
private def createResultStage(
  rdd: RDD[_],
  func: (TaskContext, Iterator[_]) => _,
  partitions: Array[Int],
  jobId: Int,
  callSite: CallSite): ResultStage = {
 val (shuffleDeps, resourceProfiles) = getShuffleDependenciesAndResourceProfiles(rdd)
 val resourceProfile = mergeResourceProfilesForStage(resourceProfiles)
  checkBarrierStageWithDynamicAllocation(rdd)
  checkBarrierStageWithNumSlots(rdd, resourceProfile)
  checkBarrierStageWithRDDChainPattern(rdd, partitions.toSet.size)
  //先创建好所有的父stage,父stage都是shufflemapstge
  val parents = getOrCreateParentStages(shuffleDeps, jobId)
  ...
  //父stage都创建好之后,再创建finalstage   
  val stage = new ResultStage(id, rdd, func, partitions, parents, jobId,
 callSite, resourceProfile.id)
  ...
 stage
}

跟进getShuffleDependenciesAndResourceProfiles(rdd),获取finalStage的依赖

private[scheduler] def getShuffleDependenciesAndResourceProfiles(
  rdd: RDD[_]): (HashSet[ShuffleDependency[_, _, _]], HashSet[ResourceProfile]) = {
 //存储当前RDD与它的父RDD的依赖
 val parents = new HashSet[ShuffleDependency[_, _, _]]
 val resourceProfiles = new HashSet[ResourceProfile]
 //用于校验当前访问的rdd是否已被访问过
 val visited = new HashSet[RDD[_]]
 //存储将要访问的rdd,即每个rdd被遍历前,需要存入waitingForVisit
 val waitingForVisit = new ListBuffer[RDD[_]]
 //rdd被放入waitingForVisit集合
 waitingForVisit += rdd
 //如果集合不为空,说明当前stage还有rdd未遍历完
 //经过前面的分析我们知道,最先被访问的是finalRDD,即调用acttion算子的那个RDD
 while (waitingForVisit.nonEmpty) {
  //rdd出队
  val toVisit = waitingForVisit.remove(0)
  if (!visited(toVisit)) {
   //加入被访问过的列表中
   visited += toVisit
   Option(toVisit.getResourceProfile).foreach(resourceProfiles += _)
   toVisit.dependencies.foreach {
    //如果当前RDD与它的的父RDD之间的依赖是宽依赖,将依赖加入parents集合,队列为空,直接退出循环
    //即遇到了宽依赖,前面的RDD与当前的RDD的stage要截断
    case shuffleDep: ShuffleDependency[_, _, _] =>
     parents += shuffleDep
    //如果是窄依赖,那么通过dependency.rdd获取当前RDD的父RDD,加入waitingForVisit队列,队列非空,继续访问遍历RDD,获取依赖
    case dependency =>
     waitingForVisit.prepend(dependency.rdd)
   }
  }
 }
 (parents, resourceProfiles)
}

当所有调度阶段划分完毕时,这些调度阶段建立起依赖关系。

提交调度阶段

在DAGScheduler的handleJobSubmitted方法中,生成finalstage的同时建立起所有调度阶段的依赖关系,然后通过finalStage生成一个作业实例,在该作业实例中按照顺序提交调度阶段进行执行。

在作业提交调度阶段开始,在submitStage方法中调用getMissingParentStages方法获取finalStage父调度阶段,如果不存在父调度阶段,则使用submitMissingTasks方法提交执行;如果存在父调度阶段,则把该调度阶段存放到waitingStages列表中,同时递归调用submitStage。通过该算法把存在父调度阶段

的等待调度阶段存入waitingStages,不存在父调度阶段的调度阶段作为作业的入口执行

提交tasks

找到最开始还没完成的stage,那么提交这个stage的Tasks。调用的函数是submitMissingTasks(stage,jobId.get).

private def submitMissingTasks(stage: Stage, jobId: Int) {
 logDebug("submitMissingTasks(" + stage + ")")
 // Get our pending tasks and remember them in our pendingTasks entry
 stage.pendingPartitions.clear()
 // First figure out the indexes of partition ids to compute.
 val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
 // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
 // with this Stage
 val properties = jobIdToActiveJob(jobId).properties
 runningStages += stage
 // SparkListenerStageSubmitted should be posted before testing whether tasks are
 // serializable. If tasks are not serializable, a SparkListenerStageCompleted event
 // will be posted, which should always come after a corresponding SparkListenerStageSubmitted
 // event.
 stage match {
  case s: ShuffleMapStage =>
   outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
  case s: ResultStage =>
   outputCommitCoordinator.stageStart(
    stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
 }
 val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
  stage match {
   case s: ShuffleMapStage =>
    partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
   case s: ResultStage =>
    val job = s.activeJob.get
    partitionsToCompute.map { id =>
     val p = s.partitions(id)
     (id, getPreferredLocs(stage.rdd, p))
    }.toMap
  }
 } catch {
  case NonFatal(e) =>
   stage.makeNewStageAttempt(partitionsToCompute.size)
   listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
   abortStage(stage, s"Task creation failed: **$**e\n**$**{Utils.exceptionString(e)}", Some(e))
   runningStages -= stage
   return
 }
 stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
 listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
 // *TODO:* *Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.*
 // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
 // the serialized copy of the RDD and for each task we will deserialize it, which means each
 // task gets a different copy of the RDD. This provides stronger isolation between tasks that
 // might modify state of objects referenced in their closures. This is necessary in Hadoop
 // where the JobConf/Configuration object is not thread-safe.
 var taskBinary: Broadcast[Array[Byte]] = null
 try {
  // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
  // For ResultTask, serialize and broadcast (rdd, func).
  val taskBinaryBytes: Array[Byte] = stage match {
   case stage: ShuffleMapStage =>
    JavaUtils.bufferToArray(
     closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
   case stage: ResultStage =>
    JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
  }
  taskBinary = sc.broadcast(taskBinaryBytes)
 } catch {
  // In the case of a failure during serialization, abort the stage.
  case e: NotSerializableException =>
   abortStage(stage, "Task not serializable: " + e.toString, *Some*(e))
   runningStages -= stage
   // Abort execution
   return
  case NonFatal(e) =>
   abortStage(stage, s"Task serialization failed: **$**e\n**$**{Utils.exceptionString(e)}", Some(e))
   runningStages -= stage
   return
 }
 val tasks: Seq[Task[_]] = try {
  stage match {
   case stage: ShuffleMapStage =>
    partitionsToCompute.map { id =>
     val locs = taskIdToLocations(id)
     val part = stage.rdd.partitions(id)
     new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
      taskBinary, part, locs, stage.latestInfo.taskMetrics, properties)
    }
   case stage: ResultStage =>
    val job = stage.activeJob.get
    partitionsToCompute.map { id =>
     val p: Int = stage.partitions(id)
     val part = stage.rdd.partitions(p)
     val locs = taskIdToLocations(id)
     new ResultTask(stage.id, stage.latestInfo.attemptId,
      taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics)
    }
  }
 } catch {
  case NonFatal(e) =>
   abortStage(stage, s"Task creation failed: **$**e\n**$**{Utils.exceptionString(e)}", Some(e))
   runningStages -= stage
   return
 }
 if (tasks.size > 0) {
  logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
  stage.pendingPartitions ++= tasks.map(_.partitionId)
  logDebug("New pending partitions: " + stage.pendingPartitions)
  taskScheduler.submitTasks(new TaskSet(
   tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
  stage.latestInfo.submissionTime = *Some*(clock.getTimeMillis())
 } else {
  // Because we posted SparkListenerStageSubmitted earlier, we should mark
  // the stage as completed here in case there are no tasks to run
  .............
  //此处省略
 }
}

以wordcount为例,介绍下stage

前面我们说过,wordcount只有一个job, 然后redeceBykey是shuffle操作,以这个stage的边界。那么前面的stage就是ShuffleMapstage, 后面的stage就是ResultStage. 因为前面会有shuffle操作,而后面是整个job的计算结果,所有叫ResultStage

ResultStage是有一个函数,应用于rdd的一些partition来计算出这个action的结果。但有些action并不是在每个partition都执行的,比如first().

接下来介绍下这个函数的执行流程。

  • 首先是计算出 paritionsToCompute,即用于计算的partition,数据。
  • 然后就是outputCommitCoordinator.stageStart,这个类是用来输出到hdfs上的,然后stageStart的两个参数,就是用于发出信息,两个参数分别是stageId和他要用于计算的partition数目。
  • 然后就是计算这个stage用于计算的TaskId对应的task所在的location。因为TaskId和partitionId是对应的,所以也就是计算partitionId对应的taskLocation。然后taskLocation是一个host或者是一个(host,executorId)二元组。
  • stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)这里创建新的attempt 就是代表这个stage执行了几次。因为stage可能会失败的。如果失败就要接着执行,这个attempt从0开始。
  • 然后就是创建广播变量,然后broadcast。广播是用于executor来解析tasks。首先要序列化,给每个task都一个完整的rdd,这样可以让task独立性更强,这对于非线程安全是有必要的。对于ShuffleMapTask我们序列化的数据是(rdd,shuffleDep),对于resultTask,序列化数据为(rdd,func)。
  • 然后是创建tasks,当然Tasks分为shuffleMapTask和resultTask,这都是跟stage类型对应的。这里创建tasks,需要用到一个参数stage.latestInfo.attemptId,这里是前面提到的。
  • 创建完tasks就是后面的taskScheduler.submitTasks(),这样任务就交由taskScheduler调度了。
override def submitTasks(taskSet: TaskSet) {
 val tasks = taskSet.tasks
 logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
 this.synchronized {
  val manager = createTaskSetManager(taskSet, maxTaskFailures)
  val stage = taskSet.stageId
  val stageTaskSets =
   taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
  stageTaskSets(taskSet.stageAttemptId) = manager
  val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
   ts.taskSet != taskSet && !ts.isZombie
  }
  if (conflictingTaskSet) {
   throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
    s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
  }
  schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
  if (!isLocal && !hasReceivedTask) {
   starvationTimer.scheduleAtFixedRate(new TimerTask() {
    override def run() {
     if (!hasLaunchedTask) {
      logWarning("Initial job has not accepted any resources; " +
       "check your cluster UI to ensure that workers are registered " +
       "and have sufficient resources")
     } else {
      this.cancel()
     }
    }
   }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
  }
  hasReceivedTask = true
 }
 backend.reviveOffers()
}

这段代码前面部分就是先创建taskManager,然后判断是否有超过一定数目的tasks存在,如果冲突就报异常。

然后把这个TaskSetManager加入schedulableBuilder,这个变量在初始化时候会选择调度策略,比如FIFO、FAIR等,加入之后就会按照相应的策略进行调度。

然后之后的判断是否为本地,和是否已经接收过任务,isLocal代表本地模式。如果非本地模式,而且还没接收到过任务,就会建立一个TimerTask,然后一直查看有没有接收到任务,因为如果没任务就是空转吗。

最后backend就会让这个tasks唤醒。backend.reviveOffers(),这里我们的backend通常是CoarseGrainedSchedulerBackend,在执行reviveOffers之后,driverEndpoint会send消息,然后backend的receive函数会接收到消息,然后执行操作。看CoarseGrainedSchedulerBackend 的receive函数。

override def receive: PartialFunction[Any, Unit] = {
 ...
 case ReviveOffers =>
  *makeOffers*()
  ...
}
private def makeOffers() {
 // Filter out executors under killing
 val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
 val workOffers = activeExecutors.map { case (id, executorData) =>
  new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
 }.toSeq
 launchTasks(scheduler.resourceOffers(workOffers))
}

上面代码显示筛选出存活的Executors,然后就创建出workerOffers,参数是executorId,host,frescoers.

执行task

然后就launchTasks

private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
 for (task <- tasks.flatten) {
  val serializedTask = ser.serialize(task)
  if (serializedTask.limit >= maxRpcMessageSize) {
   scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
    try {
     var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
      "spark.rpc.message.maxSize (%d bytes). Consider increasing " +
      "spark.rpc.message.maxSize or using broadcast variables for large values."
     msg = msg.format(task.taskId, task.index, serializedTask.limit, maxRpcMessageSize)
     taskSetMgr.abort(msg)
    } catch {
     case e: Exception => logError("Exception in error callback", e)
    }
   }
  }
  else {
   val executorData = executorDataMap(task.executorId)
   executorData.freeCores -= scheduler.CPUS_PER_TASK
   logInfo(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
    s"${executorData.executorHost}.")
   executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
  }
 }
}

上面的代码显示将task序列化,然后根据task.executorId 给他分配executor,然后就executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask))).

这里有一个executorEndPoint,之前前面有driverEndPoint(出现在backend.reviveOffer那里),这两个端口的基类都是RpcEndpointRef。RpcEndpointRef是RpcEndPoint的远程引用,是线程安全的。

RpcEndpoint是 RPC[Remote Procedure Call :远程过程调用]中定义了收到的消息将触发哪个方法。

同时清楚的阐述了生命周期,构造-> onStart -> receive -> onStop

这里receive 是指receive 和 receiveAndReply。

他们的区别是:

receive是无需等待答复,而receiveAndReply是会阻塞线程,直至有答复的。(参考:http://www.07net01.com/2016/04/1434116.html)

然后这里的driverEndPoint就是代表这个信息会发给CoarseGrainedSchedulerBackEnd,executorEndPoint就是发给coarseGrainedExecutorBackEnd当然就是发给coarseGrainedExecutorBackEnd。接下来去看相应的recieve代码。

override def receive: PartialFunction[Any, Unit] = {
 ...
 case LaunchTask(data) =>
  if (executor == null) {
   exitExecutor(1, "Received LaunchTask command but executor was null")
  } else {
   val taskDesc = ser.deserialize[TaskDescription](data.value)
   logInfo("Got assigned task " + taskDesc.taskId)
   executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,
    taskDesc.name, taskDesc.serializedTask)
  }
  ...
}

这里先将传过来的数据反序列化,然后executor.launchTask.

def launchTask(
        context: ExecutorBackend,
        taskId: Long,
        attemptNumber: Int,
        taskName: String,
        serializedTask: ByteBuffer): Unit = {
 val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,
  serializedTask)
 runningTasks.put(taskId, tr)
 threadPool.execute(tr)
}

这里新建了taskRunner,然后之后交由线程池来运行,线程池既然要运行taskRunner,必定是运行taskRunner的run方法。看taskRunner的run方法,代码太长,懒得贴,大概描述下。

主要就是设置参数,属性,反序列化出task等等,之后就要调用task.runTask方法。这里的task可能是ShuffleMapTask也可能是ResultTask,所以我们分别看这两种task的run方法。

ShuffleMapTask

先看ShuffleMapTask。

override def runTask(context: TaskContext): MapStatus = {
 // Deserialize the RDD using the broadcast variable.
 val deserializeStartTime = System.currentTimeMillis*()
 val ser = SparkEnv.get.closureSerializer.newInstance()
 val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
  ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
 _executorDeserializeTime = System.currentTimeMillis*() - deserializeStartTime
 var writer: ShuffleWriter[Any, Any] = null
 try {
  val manager = SparkEnv.get.shuffleManager
  writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
  writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
  writer.stop(success = true).get
 } catch {
  case e: Exception =>
   try {
    if (writer != null) {
     writer.stop(success = false)
    }
   } catch {
    case e: Exception =>
     log.debug("Could not stop writer", e)
   }
   throw e
 }
}

前面部分代码就是反序列化那些,主要看中间的代码。获得shuffleManager,然后getWriter。因为shuffleMapTask有Shuffle操作,所以要shuffleWrite。

ResultTask

看下ResultTask的runTask。

override def runTask(context: TaskContext): U = {
 // Deserialize the RDD and the func using the broadcast variables.
 val deserializeStartTime = System.currentTimeMillis()
 val ser = SparkEnv.get.closureSerializer.newInstance()
 val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](
  ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
 _executorDeserializeTime = System.*currentTimeMillis() - deserializeStartTime
 func(context, rdd.iterator(partition, context))
}

总结

(1)创建Spark编程入口SparkContext

(2)读取文件,将文件中的内容保存到RDD

(3)将工作分配到各主机节点

(4)各主机节点对自己分到的任务进行操作,首先进行单词划分,按空格分隔,生成flatMappedRDD

(5)然后将各单词生成Map键值对,输出(word, 1)

(6)然后将不同节点上的单词进行局部统计求和,生成局部的WordCount 的MapPatitionRDD

(7)接着对各节点间进行Shuffle,将各节点间的单词进行词频统计,生成最后的MapPatitionRDD

(8)最后输出结果

Question

Q1:最后一个stage最先提交,但是却是最后一个执行完成的stage?

Q2:shuffle操作在map和reduce阶段分别做了什么?

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
大学生参加学科竞赛有着诸多好处,不仅有助于个人综合素质的提升,还能为未来职业发展奠定良好基础。以下是一些分析: 首先,学科竞赛是提高专业知识和技能水平的有效途径。通过参与竞赛,学生不仅能够深入学习相关专业知识,还能够接触到最新的科研成果和技术发展趋势。这有助于拓展学生的学科视野,使其对专业领域有更深刻的理解。在竞赛过程中,学生通常需要解决实际问题,这锻炼了他们独立思考和解决问题的能力。 其次,学科竞赛培养了学生的团队合作精神。许多竞赛项目需要团队协作来完成,这促使学生学会有效地与他人合作、协调分工。在团队合作中,学生们能够学到如何有效沟通、共同制定目标和分工合作,这对于日后进入职场具有重要意义。 此外,学科竞赛是提高学生综合能力的一种途径。竞赛项目通常会涉及到理论知识、实际操作和创新思维等多个方面,要求参赛者具备全面的素质。在竞赛过程中,学生不仅需要展现自己的专业知识,还需要具备创新意识和解决问题的能力。这种全面的综合能力培养对于未来从事各类职业都具有积极作用。 此外,学科竞赛可以为学生提供展示自我、树立信心的机会。通过比赛的舞台,学生有机会展现自己在专业领域的优势,得到他人的认可和赞誉。这对于培养学生的自信心和自我价值感非常重要,有助于他们更加积极主动地投入学习和未来的职业生涯。 最后,学科竞赛对于个人职业发展具有积极的助推作用。在竞赛中脱颖而出的学生通常能够引起企业、研究机构等用人单位的关注。获得竞赛奖项不仅可以作为个人履历的亮点,还可以为进入理想的工作岗位提供有力的支持。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值