Spark-Core源码学习记录 5 Task的启动及回顾总结

Spark-Core源码学习记录

该系列作为Spark源码回顾学习的记录,旨在捋清Spark分发程序运行的机制和流程,对部分关键源码进行追踪,争取做到知其所以然,对枝节部分源码仅进行文字说明,不深入下钻,避免混淆主干内容。
上一篇章最后,我们来到了Executor中的launchTask方法,在本篇我们将继续进入该方法,完成对Task启动流程的追踪,并对最终结果的处理方法进行查看。

TaskRunner

接上篇最后,直接进入TaskRunner初始化的内容,并查看其重写的run方法:

def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
  // 这里会把TaskDescription和ExecutorBackend(默认是:CoarseGrainedExecutorBackend)
  // 封装成继承Runnable的TaskRunner
  val tr = new TaskRunner(context, taskDescription)
  // 放入负责维护所有正在此executor上运行的task的ConcurrentHashMap中
  runningTasks.put(taskDescription.taskId, tr)
  // 执行TaskRunner
  threadPool.execute(tr)
}
class TaskRunner(
    execBackend: ExecutorBackend,
    private val taskDescription: TaskDescription)
  extends Runnable {
  val taskId = taskDescription.taskId
  val threadName = s"Executor task launch worker for task $taskId"
  private val taskName = taskDescription.name
  /** Whether this task has been finished. */
  @GuardedBy("TaskRunner.this")
  private var finished = false

  override def run(): Unit = {
    ... // run内容比较多设置类的代码,省略,仅保留主干的代码展示,可自行查阅
    // 调用CoarseGrainedExecutorBackend的statusUpdate更新状态方法
	execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER)
    // 反序列化出task
    task = ser.deserialize[Task[Any]](
          taskDescription.serializedTask, Thread.currentThread.getContextClassLoader)
    // Run the actual task and measure its runtime.
    // def tryWithSafeFinally[T](block: => T)(finallyBlock: => Unit): T = {}
    // scala 用法
    val value = Utils.tryWithSafeFinally {
      // 调用task的run方法,正式启动task
  	  val res = task.run(
    	taskAttemptId = taskId,
    	attemptNumber = taskDescription.attemptNumber,
    	metricsSystem = env.metricsSystem)
     threwException = false
     res
    } {
	  //其实就是经过封装的finally,内容包括 释放task的锁、内存
	}
	// 序列化返回的结果
	val valueBytes = resultSer.serialize(value)
	...
	// Note: accumulator updates must be collected after TaskMetrics is updated
	// 获取task中累加器的结果
    val accumUpdates = task.collectAccumulatorUpdates()
    // 将返回结果和累加器封装
    val directResult = new DirectTaskResult(valueBytes, accumUpdates)
    val serializedDirectResult = ser.serialize(directResult)
    val resultSize = serializedDirectResult.limit()

	// directSend = sending directly back to the driver
	val serializedResult: ByteBuffer = {
	  // maxResultSize默认1G,在配置中更改
	  if (maxResultSize > 0 && resultSize > maxResultSize) {
	    /** IndirectTaskResult:A reference to a DirectTaskResult that has been stored in the worker's BlockManager. */
	    // 超限时,丢弃结果,返回空引用
	 	ser.serialize(new IndirectTaskResult[Any](TaskResultBlockId(taskId), resultSize))
  	  }else if (resultSize > maxDirectResultSize) {
  	    // 如果大于直接结果限制,就存入blockManager,返回一个引用即可
  	    env.blockManager.putBytes(
              blockId,
              new ChunkedByteBuffer(serializedDirectResult.duplicate()),
              StorageLevel.MEMORY_AND_DISK_SER)
       } else {
         serializedDirectResult
       }
	/* Set the finished flag to true and clear the current thread's interrupt status*/
	setTaskFinishedAndClearInterruptStatus()
	// 再次调用CoarseGrainedExecutorBackend的statusUpdate更新状态方法
	execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)
	} finally {
       runningTasks.remove(taskId)
    }
  }
}	

我们只需关注task.run()以及statusUpdate方法,先看statusUpdate

override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer) {
  val msg = StatusUpdate(executorId, taskId, state, data)
  driver match {
    // 可以看到其实就像driver端发送封装的 StatusUpdate 消息
    case Some(driverRef) => driverRef.send(msg)
    case None => logWarning(s"Drop $msg because has not yet connected to driver")
  }
}
// CoarseGrainedSchedulerBackend作为Driver端实例
override def receive: PartialFunction[Any, Unit] = {
  case StatusUpdate(executorId, taskId, state, data) =>
    // 这里调用的是TaskSchedulerImpl的statusUpdate方法
    // 更新一下Task状态相关的容器,内部有一个TaskResultGetter,用于处理传过来的计算结果。不再展开
    scheduler.statusUpdate(taskId, state, data.value)
    if (TaskState.isFinished(state)) {
      executorDataMap.get(executorId) match {
        case Some(executorInfo) =>
          // 更新cpu信息
          executorInfo.freeCores += scheduler.CPUS_PER_TASK
          // 因为Task运行完后,释放了资源,因此可以当做一个新的executor加入调度
          makeOffers(executorId)
        case None =>
          // Ignoring the update since we don't know about the executor.
          logWarning(s"Ignored task status update ($taskId state $state) " +
            s"from unknown executor with ID $executorId")
      }
    }
}

之后task.run()方法:

  /**
   * Called by [[org.apache.spark.executor.Executor]] to run this task.
   * @return the result of the task along with updates of Accumulators.
   */
  final def run(
      taskAttemptId: Long,
      attemptNumber: Int,
      metricsSystem: MetricsSystem): T = {
	// 封装Task相关信息
	val taskContext = new TaskContextImpl(
      stageId,
      stageAttemptId, // stageAttemptId and stageAttemptNumber are semantically equal
      partitionId,
      taskAttemptId,
      attemptNumber,
      taskMemoryManager,
      localProperties,
      metricsSystem,
      metrics)
    try {
      // 最终调用,有不同实现
      runTask(context)
    } catch {...}

最后通过调用Task实现类的runTask方法,具体ShuffleMapTaskResultTask有不通的实现:

// ShuffleMapTask
override def runTask(context: TaskContext): U = {
  // Deserialize the RDD using the broadcast variable.
  val threadMXBean = ManagementFactory.getThreadMXBean
  val deserializeStartTimeNs = System.nanoTime()
  val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
    threadMXBean.getCurrentThreadCpuTime
  } else 0L
  val ser = SparkEnv.get.closureSerializer.newInstance()
  // 返回 rdd和dep
  val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
    ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
  _executorDeserializeTimeNs = System.nanoTime() - deserializeStartTimeNs
  _executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
    threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
  } else 0L

  dep.shuffleWriterProcessor.write(rdd, dep, partitionId, context, partition)
}
//ResultTask
override def runTask(context: TaskContext): MapStatus = {
   //...
   // 前面都一样,仅两行代码不同
   // 返回值与上面不通,是rdd和func
   val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](
    ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
   func(context, rdd.iterator(partition, context))
}

以ShuffleMapTask为例,看注释内容

/**
   * The write process for particular partition, it controls the life circle of [[ShuffleWriter]]
   * get from [[ShuffleManager]] and triggers rdd compute, finally return the [[MapStatus]] for
   * this task.
   */
  def write(...): MapStatus = {
    var writer: ShuffleWriter[Any, Any] = null
    try {
    val manager = SparkEnv.get.shuffleManager
      writer = manager.getWriter[Any, Any](
        dep.shuffleHandle,
        partitionId,
        context,
        createMetricsReporter(context))
      /** Write a sequence of records to this task's output */
      writer.write(
        rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
      writer.stop(success = true).get
    }catch {...}
  }

回顾核心内容

作为整个系列的最后,回顾一下从Stage划分到Task运行的流程:

1.Action算子触发任务,以count为例,内部调用dagScheduler的runJob方法
2.runJob内部调用submitJob,内部实例化并返回JobWaiter,同时向event循环队列插入JobSubmitted事件消息
3.eventProcessLoop接受消息后,调用handleJobSubmitted方法,开始划分stage

4.直接根据最后的RDD企图创建ResultStage
4.1根据RDD的宽依赖,划分出末尾的Stage,准备返回实例化ResultStage
4.2实例化的过程是从最后一个RDD开始,然后不停遍历自己的父RDD的依赖,并且查看是否之前持久化过(包括缓存,物化,以及Checkp
oint),直到发现持久化或者到第一个RDD为止,从前往后一次计算直到产生ResultStage
4.3关于宽窄依赖是怎么判断的:其实是每个算子内部的方法判断,当前RDD的partitioner和父RDD的partitioner的属性是否相等,相等
就直接转换成MapPartitionsRDD,就不会产生shuffle了,否则就会生成ShuffledRDD,这个信息记录在每个RDD内部,包括每个分区的数
据位置,用于之后的Task划分
4.4拿到了finalStage,在更新和封装了一些属性后,进入提交Job的入口submitStage(finalStage)
4.5然而内部需要先确保所有的父stage均被实例化,因此getMissingAncestorShuffleDependencies方法会循环遍历实例化shuffleMapSt
age,在这个过程中,从后往前,划分Stage,然后从前往后实例化

5.实例化Stage完成后,进入Task提交部分submitMissingTasks(stage: Stage, jobId: Int)
5.1根据partitionId和这个stage的RDD调用Task最佳位置划分算法
a.从内存,磁盘和堆外查找是否有持久化过 b.判断是否checkpoint过 c.从BlockManager中取 d.遍历RDD的窄依赖,每个partition递归调用getPreferredLocsInternal方法,即从第一个窄依赖的第一个partition开始,然后将每个
partition的最佳位置,添加到序列中,最后返回所有partition的最佳位置序列
5.2生成ShuffleMapTask或者ResultTask
5.3Tasks封装成TaskSet交给taskScheduler提交到各个executor上
5.4创建TaskSetManager,加入到调度池中的schedulableQueue队列里,最终通过CoarseGrainedSchedulerBackend.reviveOffers()
5.5最终触发DriverEnpoint的消息ReviveOffers,内部调用makeOffers()
5.6把过滤好的每个executor元数据简单封装成WorkerOffer递交给TaskSchedulerImpl
5.7scheduler.resourceOffers根据每个task的本地级别,指定executor,返回封装好的TaskDescription
5.8最后回到CoarseGrainedSchedulerBackend,开始为每个Executor分发TaskDescription
5.9Driver端会用每个executorEndpoint的引用发送LaunchTask事件消息,事件消息里封装了序列化后的TaskDescription
5.10对应的executor会调用receive接受到并匹配,最后把TaskDescription封装成继承Java线程的Runnable 
调用用线程池去run,内部调用task.run
5.11内部调用runTask(context),ShuffleMapTask和ResultTask有不通的实现

参考:

Apache Spark 2.3 源码

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值