Spark Job触发流程原理与源码解析

最新推荐文章于 2022-05-27 16:14:07 发布

发布了一场Chat

最新推荐文章于 2022-05-27 16:14:07 发布

阅读量312

点赞数

分类专栏： spark深入学习文章标签： spark 触发job流程 spark wordcount程序流程原理解析

本文链接：https://blog.csdn.net/u013174239/article/details/80326507

版权

spark深入学习专栏收录该内容

17 篇文章 2 订阅

订阅专栏

spark触发job的流程示意图：

通过对wordcount案例解析，来分析spark job的触发流程。wordcount代码如下

var linesRDD= sc.textFile('hdfs://')
var wordsRDD = linesRDD.flatMap(line => line.split(" "))
var pairsRDD = wordsRDD.map(word => (word,1))
var countRDD = pairsRDD.reduceByKey(_+_)
countRDD.foreach(count => println(count._1 + ":"+count._2))

①sc.textFile方法

SparkContext.scala

  /**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   */
  /**
    * hadoopFile方法调用会创建一个HadoopRDD，其中的元素pair是（key,value）
    * 然后，调用map方法，过滤掉key，剩下value，最终获得一个MapPartitionRDD，其内部是一行一行的文本行
    */
  def textFile(
      path: String,  // 传入hdfs路径
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()

    // hadoop mapreduce TextInputFormat：读取文本数据的一种方式  LongWritable 读取到的文本的偏移量（行号） Text 读到的文本
    // map操作中，pair就是行号和文本数据映射的tuple，pair._2.toString 就是取数据
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

  def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()

    // This is a hack to enforce loading hdfs-site.xml.
    // See SPARK-11227 for details.
    FileSystem.getLocal(hadoopConfiguration)

    // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))  // 读取hadoop配置文件，并广播变量
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path) //  设置输入格式化

    // 创建的HadoopRDD读取配置文件的之后，上面已经做了广播变量，在本机worker上就可以读到
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }

hadoopFile().map() 方法最终调用的是RDD.scala的map方法

RDD.scala

  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

②linesRDD.flatMap方法

RDD.scala

  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

③wordsRDD.map方法

RDD.scala

  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

④pairsRDD.reduceByKey

rdd中是没有reduceByKey方法的，这里有一个隐式转换（相当于Java中的包装类），程序执行过程中会找RDD.scala中的一个隐式转换代码

  implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
    new PairRDDFunctions(rdd)
  }

最终创建了PairRDDFunctions类，MapPartitionRDD调用reduceByKey会触发scala隐式转换，在作用域中内寻找隐式转换，最终将MapPartitionRDD转换为PairRDDFunctions，调用reduceByKey方法。

PairRDDFunctions.scala

  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
   */
  def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
    reduceByKey(new HashPartitioner(numPartitions), func)
  }

⑤countRDD.foreach方法

RDD.scala

  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

SparkContext.scala

这里有多个runJob的嵌套调用

  /**
   * Run a job on a given set of partitions of an RDD, but take a function of type
   * `Iterator[T] => U` instead of `(TaskContext, Iterator[T]) => U`.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
  }

  /**
   * Run a function on a given set of partitions in an RDD and return the results as an array.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int]): Array[U] = {
    val results = new Array[U](partitions.size)
    runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
    results
  }

  /**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }

    // 调用SparkContext初始化创建的DAGSChduler的runJob方法
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)

    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

发布了一场Chat

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark Job触发流程原理与源码解析

spark触发job的流程示意图：通过对wordcount案例解析，来分析spark job的触发流程。wordcount代码如下var linesRDD= sc.textFile('hdfs://')var wordsRDD = linesRDD.flatMap(line =&gt; line.split(" "))var pairsRDD = wordsRDD.map(word =&gt;...
复制链接

扫一扫

专栏目录