spark源码解析之三、任务切分与运行

最新推荐文章于 2023-06-27 15:59:10 发布

comeOnBaby126

最新推荐文章于 2023-06-27 15:59:10 发布

阅读量689

点赞数 1

分类专栏： spark 源码时序图文章标签： spark scala 大数据

本文链接：https://blog.csdn.net/comeOnBaby126/article/details/113955227

版权

spark运行流程分为资源环境准备和任务提交运行两个步骤，两个步骤交叉进行，当前以任务提交为主线进行源码分析。
资源环境准备线，可以参考spark源码解析之二、计算资源准备
关于源代码的前期准备可以参考：spark源码解析之一、整体概述

一、spark任务提交时序图

本次源码跟踪是在yarn-cluster模式下的原码，在源码中只关注cluster模式，如果没有特殊说明，默认yarn-cluster模式。
鉴于spark源码的复杂性，为了今后复习方便，按照时间先后顺序将spark应用启动的任务提交流程源码绘制时序图。在时序图中只是列举了重要的节点，与主题无关的类或者对象没有列举，在绘制图形是没有严格按照时序图的标准进行，有问题请各位即使批评指正，相互学习。
在这里插入图片描述

二、源码分析

源码上下文顺序与时序图顺序基本保持一致，一级标题例如标识类，这里没有展示全类名；二级标题标识该类下的方法，源码的上下文并没有按照标题顺序，而是按照逻辑调用的顺序进行，所以二级标题是错乱的，但是在一级标题下的二级标题是有序的。
原则上不应该从WordCount 开始，应该从SparkSubmit开始，鉴于上一篇已经分析过spark资源准备的源码，这里就从WordCount 开始。

1 WordCount

1.1 WordCount.main

由于只有行动算子才会触发任务提交，所以从collect方法开始。

object WordCount {
   
  def main(args: Array[String]): Unit = {
   
    val conf = new SparkConf().setAppName("WordCount").setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.makeRDD(List(1, 2)).collect()
    sc.stop()
  }
}

调用RDD的collect方法。

2 RDD

2.1 RDD.collect

  def collect(): Array[T] = withScope {
   
	  //注意这里的this,其实是将调用行动算子的结果RDD,这个RDD会一直向下传递直到任务切分
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

调用SparkContext的runJob方法。

3 SparkContext

Spark上下文，SparkContext初始化在上一篇已经分析过，这里不再赘述，但是跟任务提交相关的在下文中还会有所涉及。

3.1 SparkContext.runJob

def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
   
	//注意这里的第二个参数partitions,这里取得是当前RDD的分区数，这个参数会一直传下去，一直传下去
	//这个参数会一直到创建stage，stage的最终分区数就是这个分区数，稍后的源码中会看到这些
    runJob(rdd, func, 0 until rdd.partitions.length)
}
....
 def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
   
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
  }
....
 def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int]): Array[U] = {
   
    val results = new Array[U](partitions.size)
    runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
    results
  }
....
 def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
   
	....
	//将job提交给DAGScheduler运行
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    ....
  }

到这里我们的DAGScheduler调度器开始工作了，我们关注一下DAGScheduler.runJob。

4 DAGScheduler

DAGScheduler也叫有向无环图调度器，又叫阶段划分器，也可以叫上层调度器，主要作用，将提交的job进行阶段划分并生成有向无环图，并提交不同阶段进行运行。

4.1 DAGScheduler.runJob

  def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
   
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    ....
  }

继续关注 DAGScheduler.submitJob方法。

4.2 DAGScheduler.submitJob

//eventProcessLoop是DAGScheduler的一个内部属性，DAGSchedulerEventProcessLoop主要用于内部消息信
private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
....
def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
   
    ....
    val jobId = nextJobId.getAndIncrement()
    ....
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    //通过eventProcessLoop发送类型为JobSubmitted的消息,在DAGSchedulerEventProcessLoop中接收消息
    //在DAGSchedulerEventProcessLoop中搜索JobSubmitted
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }

eventProcessLoop.post其实是向EventLoop的消息队列eventQueue中放入一条消息，消息类型为JobSubmitted，所以需要到DAGSchedulerEventProcessLoop的doOnReceive方法中找到对应的消息类型。
至于为什么是这么一个逻辑，可以参考文末最后一个章节第16节。

5 DAGSchedulerEventProcessLoop

5.1 DAGSchedulerEventProcessLoop.doOnReceive:JobSubmitted

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
   
    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
  }

这里又调回到重DAGScheduler的handleJobSubmitted方法。

4.3 DAGScheduler.handleJobSubmitted

   private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
   
    var finalStage: ResultStage = null
    try {
   
      ....
	  //根据finalRDD创建ResultSta

最低0.47元/天解锁文章

comeOnBaby126

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
3
评论
spark源码解析之三、任务切分与运行

spark运行流程分为资源环境准备和任务提交运行两个步骤，两个步骤交叉进行，当前以任务提交为主线进行源码分析。资源环境准备线，可以参考spark源码解析之二、计算资源准备一、spark任务提交时序图本次源码跟踪是在yarn-cluster模式下的原码，在源码中只关注cluster模式，如果没有特殊说明，默认yarn-cluster模式。鉴于spark源码的复杂性，为了今后复习方便，按照时间先后顺序将spark应用启动的任务提交流程源码绘制时序图。在时序图中只是列举了重要的节点，与主题无关的类或者对象
复制链接

扫一扫

专栏目录