Spark提交作业的流程以及作业是如何被触发在集群中运行的

最新推荐文章于 2024-08-11 23:51:02 发布

javartisan

最新推荐文章于 2024-08-11 23:51:02 发布

阅读量3.3k

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/Dax1n/article/details/69357716

版权

Spark 专栏收录该内容

70 篇文章 0 订阅

订阅专栏

首先使用脚本spark-submit将作业提交，这个过程实际上就是使用shell脚本调用java命令运行的SparkSubmit类的main方法，所以我们接下来需要看一下SparkSubmit的main方法做了什么？

  /**
    * 提交作业
    * @param args
    */
  def main(args: Array[String]): Unit = {

    val appArgs = new SparkSubmitArguments(args)//封装参数
    if (appArgs.verbose) {
      // scalastyle:off println
      printStream.println(appArgs)
      // scalastyle:on println
    }
    appArgs.action match {//模式匹配提交的到底是什么任务
      case SparkSubmitAction.SUBMIT => submit(appArgs)
      case SparkSubmitAction.KILL => kill(appArgs)
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
    }
  }

无非就是使用模式匹配识别用户提交的命令，当我们提交作业时候便调用submit方法，接下来看一下submit方法：

  @tailrec
  private def submit(args: SparkSubmitArguments): Unit = {
     //省略非重点代码
    if (args.isStandaloneCluster && args.useRest) {
      try {
        // scalastyle:off println
        printStream.println("Running Spark using the REST application submission protocol.")
        // scalastyle:on println
        doRunMain()
      } catch {
        // Fail over to use the legacy submission gateway
        case e: SubmitRestConnectionException =>
          printWarning(s"Master endpoint ${args.master} was not a REST server. " +
            "Falling back to legacy submission gateway instead.")
          args.useRest = false
          submit(args)
      }
    // In all other modes, just run the main class as prepared
    } else {
      doRunMain()
    }
  }

在submit方法中调用的是 doRunMain()方法：

  private def runMain(
      childArgs: Seq[String],
      childClasspath: Seq[String],
      sysProps: Map[String, String],
      childMainClass: String,
      verbose: Boolean): Unit = {

    var mainClass: Class[_] = null

		//省略非重点代码
      mainClass = Utils.classForName(childMainClass)
     	//省略非重点代码

    if (classOf[scala.App].isAssignableFrom(mainClass)) {
      printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
    }
	//省略非重点代码
    val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
    if (!Modifier.isStatic(mainMethod.getModifiers)) {
      throw new IllegalStateException("The main method in the given main class must be static")
    }
 	//省略非重点代码

    try {
      mainMethod.invoke(null, childArgs.toArray)
    } catch {
      case t: Throwable =>
        findCause(t) match {
          case SparkUserAppException(exitCode) =>
            System.exit(exitCode)

          case t: Throwable =>
            throw t
        }
    }
  }

在 runMain()方法重使用反射调用我们自己开发的作业jar中的main方法，接下来就是创建SparkContext，执行其主构造器完成大量的初始化操作，最重要的是TaskScheduler和DAGScheduler创建，以及TaskScheduler的启动！

接下来我们以统计文件行数程序讲解，这个作业就一个count操作。

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object WordCount {
  def main(args: Array[String]) {
    if (args.length < 1) {
      System.err.println("Usage: <file>")
      System.exit(1)
    }

    val conf = new SparkConf()
    val sc = new SparkContext(conf)
    val rdd= sc.textFile(args(0))

    rdd.count()

   sc.stop()
  }
}

关于SparkContext的创建，到时候会单独写一篇博客，这里面暂时不去重点讲解SparkContext主构造器的执行。

统计单词行数的程序，当sc创建完毕之后，我们使用textFile创建了一个rdd，接下来我们看看textFile的源码：

  /**
    * Read a text file from HDFS, a local file system (available on all nodes), or any
    * Hadoop-supported file system URI, and return it as an RDD of Strings.
    */
  def textFile(
                path: String,
                minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

textFile使用hadoopFile创建的RDD，接下来看一下h adoopFile源码：

  def hadoopFile[K, V](
                        path: String,
                        inputFormatClass: Class[_ <: InputFormat[K, V]],
                        keyClass: Class[K],
                        valueClass: Class[V],
                        minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {

    FileSystem.getLocal(hadoopConfiguration)
     //...省略非重点代码
    // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))//广播Hadoop文件
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    new HadoopRDD(this,confBroadcast,Some(setInputPathsFunc),inputFormatClass, //参数this是我们的重点讲解之处
      keyClass,valueClass,minPartitions).setName(path)
  }

最终我们可以看到， 创建RDD的时候通过使用this将当前的SparkContext对象传进去了，其实Spark中所有创建RDD的方法都需要有当前SparkContext的引用(原因后面讲解)。

此时我们已经创建完毕了rdd，接下来需要计算行数了，我们便调用count算子，count算子是一个action，我们都知道只有actioin是触发job在集群中运行，transformation并不会触发作业运行，那么是如何触发job运行的呢？是谁调用的runJob方法触发作业执行的呢？至此我们只见到在SparkContext中启动一个TaskScheduler，而没见其他调用或者启动！！！其实奥妙就在rdd持有SaprkContext引用之中。

接下来我们看一下org.apache.spark.rdd.RDD#count都做了什么？

  /**
   * Return the number of elements in the RDD.
   */
  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

明白了吧，runJob原来是在count算子中使用该rdd的一个sparkContext引用调用的runJob来触发作业在集群中运行，到此是不是就解决了困惑呢!!

但是其实问题又出现了，为什么map算子不可以触发作业到集群中运行呢？想知道为什么的最好办法还是看源码，学习任何开源项目，如果懂了大概原理和使用之后，遇到问题解决问题最好的方法就是看源码，跟踪源码！那我们就看看map算子源码吧。

org.apache.spark.rdd.RDD#map：

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

明白了吧，map算子就没有使用sparkContext引用调用runJob方法，而是返回一个 MapPartitionsRDD来表示结果！