spark源码分析之submit的提交过程(一)

spark源码分析之submit的提交过程

  • 1.当我们向集群提交如下命令
bin/spark-submit \
--class com.wt.spark.WordCount \
--master yarn \
WordCount.jar \
/input \
/output
  • 2.启动脚本调用的是spark-submit,因此我们直接去看spark-submit脚本

    # -z是检查后面变量是否为空(空则真) shell可以在双引号之内引用变量,单引号不可
    
    #这一步作用是检查SPARK_HOME变量是否为空,为空则执行then后面程序
    
    #source命令: source filename作用在当前bash环境下读取并执行filename中的命令
    #$0代表shell脚本文件本身的文件名,这里即使spark-submit
    #dirname用于取得脚本文件所在目录 dirname $0取得当前脚本文件所在目录
    #$(命令)表示返回该命令的结果
    
    #故整个if语句的含义是:如果SPARK_HOME变量没有设置值,则执行当前目录下的find-spark-home脚本文件,设置SPARK_HOME值
    if [ -z "${SPARK_HOME}" ]; then
      source "$(dirname "$0")"/find-spark-home
    fi
    
    # disable randomized hash for string in Python 3.3+
    export PYTHONHASHSEED=0
    #执行spark-class脚本,传递参数org.apache.spark.deploy.SparkSubmit 和"$@"
    #这里$@表示之前spark-submit接收到的全部参数
    exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
    
    • 在spark-submit脚本最后执行了如下代码

      exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
      
      • 所以spark-submit脚本的整体逻辑就是:
        • 首先 检查SPARK_HOME是否设置;
        • if 已经设置 执行spark-class文件
        • 否则加载执行find-spark-home文件
  • find-spark-home脚本

    #定义一个变量用于后续判断是否存在定义SPARK_HOME的python脚本文件
    FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "$(dirname "$0")"; pwd)/find_spark_home.py"
    
    # Short cirtuit if the user already has this set.
    ##如果SPARK_HOME为不为空值,成功退出程序
    if [ ! -z "${SPARK_HOME}" ]; then
       exit 0
    # -f用于判断这个文件是否存在并且是否为常规文件,是的话为真,这里不存在为假,执行下面语句,给SPARK_HOME变量赋值
    elif [ ! -f "$FIND_SPARK_HOME_PYTHON_SCRIPT" ]; then
      # If we are not in the same directory as find_spark_home.py we are not pip installed so we don't
      # need to search the different Python directories for a Spark installation.
      # Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or
      # spark-submit in another directory we want to use that version of PySpark rather than the
      # pip installed version of PySpark.
      export SPARK_HOME="$(cd "$(dirname "$0")"/..; pwd)"
    else
      # We are pip installed, use the Python script to resolve a reasonable SPARK_HOME
      # Default to standard python interpreter unless told otherwise
      if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
         PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}"
      fi
      export SPARK_HOME=$($PYSPARK_DRIVER_PYTHON "$FIND_SPARK_HOME_PYTHON_SCRIPT")
    fi
    
    • 可以看到,如果事先用户没有设定SPARK_HOME的值,这里程序也会自动设置并且将其注册为环境变量,供后面程序使用
    • 当SPARK_HOME的值设定完成之后,就会执行Spark-class文件,这也是我们分析的重要部分,源码如下:
  • spark-class脚本

    #!/usr/bin/env bash
    #依旧是检查设置SPARK_HOME的值
    if [ -z "${SPARK_HOME}" ]; then
      source "$(dirname "$0")"/find-spark-home
    fi
    #执行load-spark-env.sh脚本文件,主要目的在于加载设定一些变量值
    #设定spark-env.sh中的变量值到环境变量中,供后续使用
    #设定scala版本变量值
    . "${SPARK_HOME}"/bin/load-spark-env.sh
    
    # Find the java binary
    #检查设定java环境值
    #-n代表检测变量长度是否为0,不为0时候为真
    #如果已经安装Java没有设置JAVA_HOME,command -v java返回的值为${JAVA_HOME}/bin/java
    if [ -n "${JAVA_HOME}" ]; then
      RUNNER="${JAVA_HOME}/bin/java"
    else
      if [ "$(command -v java)" ]; then
        RUNNER="java"
      else
        echo "JAVA_HOME is not set" >&2
        exit 1
      fi
    fi
    
    # Find Spark jars.
    #-d检测文件是否为目录,若为目录则为真
    #设置一些关联Class文件
    if [ -d "${SPARK_HOME}/jars" ]; then
      SPARK_JARS_DIR="${SPARK_HOME}/jars"
    else
      SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
    fi
    
    if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
      echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
      echo "You need to build Spark with the target \"package\" before running this program." 1>&2
      exit 1
    else
      LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
    fi
    
    # Add the launcher build dir to the classpath if requested.
    if [ -n "$SPARK_PREPEND_CLASSES" ]; then
      LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
    fi
    
    # For tests
    if [[ -n "$SPARK_TESTING" ]]; then
      unset YARN_CONF_DIR
      unset HADOOP_CONF_DIR
    fi
    
    # The launcher library will print arguments separated by a NULL character, to allow arguments with
    # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
    # an array that will be used to exec the final command.
    #
    # The exit code of the launcher is appended to the output, so the parent shell removes it from the
    # command array and checks the value to see if the launcher succeeded.
    #执行类文件org.apache.spark.launcher.Main,返回解析后的参数
    build_command() {
      "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
      printf "%d\0" $?
    }
    
    # Turn off posix mode since it does not allow process substitution
    #将build_command方法解析后的参数赋给CMD
    set +o posix
    CMD=()
    while IFS= read -d '' -r ARG; do
      CMD+=("$ARG")
    done < <(build_command "$@")
    
    COUNT=${#CMD[@]}
    LAST=$((COUNT - 1))
    LAUNCHER_EXIT_CODE=${CMD[$LAST]}
    
    # Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes
    # the code that parses the output of the launcher to get confused. In those cases, check if the
    # exit code is an integer, and if it's not, handle it as a special error case.
    if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
      echo "${CMD[@]}" | head -n-1 1>&2
      exit 1
    fi
    
    if [ $LAUNCHER_EXIT_CODE != 0 ]; then
      exit $LAUNCHER_EXIT_CODE
    fi
    
    CMD=("${CMD[@]:0:$LAST}")
    #执行CMD中的某个参数类org.apache.spark.deploy.SparkSubmit
    exec "${CMD[@]}"
    
    • spark-class文件的执行逻辑稍显复杂,总体上应该是这样的:
      • 检查SPARK_HOME的值----》加载load-spark-env.sh文件,设定一些需要用到的环境变量,如scala环境值,这其中也加载了spark-env.sh文件-------》检查设定java的执行路径变量值-------》寻找spark jars,设定一些引用相关类的位置变量------》执行类文件org.apache.spark.launcher.Main,返回解析后的参数给CMD-------》判断解析参数是否正确(代表了用户设置的参数是否正确)--------》正确的话执行org.apache.spark.deploy.SparkSubmit这个类
  • 3.执行org.apache.spark.deploy.SparkSubmit类

      def main(args: Array[String]): Unit = {
          //拿到submit脚本传入的参数
        val appArgs = new SparkSubmitArguments(args)
        if (appArgs.verbose) {
          // scalastyle:off println
          printStream.println(appArgs)
          // scalastyle:on println
        }
          //根据传入的参数匹配对应的执行方法
        appArgs.action match {
            //根据传入的参数提交命令
          case SparkSubmitAction.SUBMIT => submit(appArgs)
            //只有standalone和mesos集群模式才触发
          case SparkSubmitAction.KILL => kill(appArgs)
            //只有standalone和mesos集群模式才触发
          case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
        }
      }
    
    • 1.调用val appArgs = new SparkSubmitArguments(args)

      加载环境变量主要三部分:1.解析命令行数据、2.加载环境变量到内存、3.验证参数

      // Set parameters from command line arguments
        try {
          parse(args.asJava)
        } catch {
          case e: IllegalArgumentException =>
            SparkSubmit.printErrorAndExit(e.getMessage())
        }
        // 这里是加载默认的参数,其实就是spark-default.conf里面的值
        mergeDefaultSparkProperties()
        // Remove keys that don't start with "spark." from `sparkProperties`.
        //忽略无效的spark参数,其实就是参数不以spark.开头的
        ignoreNonSparkProperties()
        // Use `sparkProperties` map along with env vars to fill in any missing parameters
        //这里是根据与命令行加载环境参数 master、扩展类、core数量、内存、mainClass等等
        loadEnvironmentArguments()
        //验证参数的有效性
        validateArguments()
      
    • 2.执行submit(appArgs)

      private def submit(args: SparkSubmitArguments): Unit = {
          //准备提交环境
          val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)
      
          def doRunMain(): Unit = {
            if (args.proxyUser != null) {
              val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
                UserGroupInformation.getCurrentUser())
              try {
                proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
                  override def run(): Unit = {
                    runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
                  }
                })
              } catch {
                case e: Exception =>
                  // Hadoop's AuthorizationException suppresses the exception's stack trace, which
                  // makes the message printed to the output by the JVM not very helpful. Instead,
                  // detect exceptions with empty stack traces here, and treat them differently.
      
                  if (e.getStackTrace().length == 0) {
                    // scalastyle:off println
                   //在standalone cluster模式。有两种gateways提交方式:
       			//([1)传统的RPC gateway 方式,”使用的是o.a.S.deploy.client/ [2)在Spark 1.3以后,使用new REST-based gateway
      			//Spark 1.3之后默认使用new REST-based gateway  , but Spark submit will fail over/但是Spark submit 提交失败的话,使用的是传统的gateways .
                      
                    printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
                    // scalastyle:on println
                    exitFn(1)
                  } else {
                    throw e
                  }
              }
            } else {
              runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
            }
          }
      
      • 在 standalone cluster 模式, 有两种 gateways 提交方式: 由args.useRest 控制

        1. 传统的 RPC gateway 方式 ,使用的是 o.a.s.deploy.Client args.useRest 值为false
        2. 在 Spark 1.3 以后,使用 new REST-based gateway args.useRest 值为ture(默认)
          Spark 1.3之后默认使用 new REST-based gateway , but Spark submit will fail over
          但是 Spark submit 提交失败的话 , 使用的是传统的gateways .
  • 4.执行runmain()方法

    private def runMain(
          childArgs: Seq[String],
          childClasspath: Seq[String],
          sysProps: Map[String, String],
          childMainClass: String,
          verbose: Boolean): Unit = {
        // scalastyle:off println
        if (verbose) {
          printStream.println(s"Main class:\n$childMainClass")
          printStream.println(s"Arguments:\n${childArgs.mkString("\n")}")
          printStream.println(s"System properties:\n${sysProps.mkString("\n")}")
          printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}")
          printStream.println("\n")
        }
        // scalastyle:on println
    //1.获取类加载器,并加载类
        val loader =
          if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
            new ChildFirstURLClassLoader(new Array[URL](0),
              Thread.currentThread.getContextClassLoader)
          } else {
            new MutableURLClassLoader(new Array[URL](0),
              Thread.currentThread.getContextClassLoader)
          }
        Thread.currentThread.setContextClassLoader(loader)
    	//通过类加载器加载jar包中的类
        for (jar <- childClasspath) {
          addJarToClasspath(jar, loader)
        }
    	//资源配置
        for ((key, value) <- sysProps) {
          System.setProperty(key, value)
        }
    
        var mainClass: Class[_] = null
    
        try {
            //将childMainClass传过来创建此类,也就是说通过反射获取类对象,进里面看一下就知道了
          mainClass = Utils.classForName(childMainClass)
        } catch {
          case e: ClassNotFoundException =>
            e.printStackTrace(printStream)
            if (childMainClass.contains("thriftserver")) {
              // scalastyle:off println
              printStream.println(s"Failed to load main class $childMainClass.")
              printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
              // scalastyle:on println
            }
            System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
          case e: NoClassDefFoundError =>
            e.printStackTrace(printStream)
            if (e.getMessage.contains("org/apache/hadoop/hive")) {
              // scalastyle:off println
              printStream.println(s"Failed to load hive class.")
              printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
              // scalastyle:on println
            }
            System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
        }
    
        // SPARK-4170
        if (classOf[scala.App].isAssignableFrom(mainClass)) {
          printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
        }
    	//通过反射获取的类对象,获取main方法
        val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
        if (!Modifier.isStatic(mainMethod.getModifiers)) {
          throw new IllegalStateException("The main method in the given main class must be static")
        }
    
        @tailrec
        def findCause(t: Throwable): Throwable = t match {
          case e: UndeclaredThrowableException =>
            if (e.getCause() != null) findCause(e.getCause()) else e
          case e: InvocationTargetException =>
            if (e.getCause() != null) findCause(e.getCause()) else e
          case e: Throwable =>
            e
        }
    
        try {
          //调用指定类的主方法
          mainMethod.invoke(null, childArgs.toArray)
        } catch {
          case t: Throwable =>
            findCause(t) match {
              case SparkUserAppException(exitCode) =>
                System.exit(exitCode)
    
              case t: Throwable =>
                throw t
            }
        }
      }
    
    • 那么问题来了,childMainClass是什么?

      • childMainClass最初出现在SparkSubmit类中,作为方法的返回值出现

        private def submit(args: SparkSubmitArguments): Unit = {
            
            val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)
            
            ...
        }
        
        • 我们进入prepareSubmitEnvironment(args)这个方法中

          private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments)
                : (Seq[String], Seq[String], Map[String, String], String) = {
              // Return values
              val childArgs = new ArrayBuffer[String]()
              val childClasspath = new ArrayBuffer[String]()
              val sysProps = new HashMap[String, String]()
              var childMainClass = ""
                    ...
              if (deployMode == CLIENT || isYarnCluster) {
                childMainClass = args.mainClass
                if (isUserJar(args.primaryResource)) {
                  childClasspath += args.primaryResource
                }
                if (args.jars != null) { childClasspath ++= args.jars.split(",") }
              }
                    ...
              if (isYarnCluster) {
                childMainClass = "org.apache.spark.deploy.yarn.Client"
                if (args.isPython) {
                  childArgs += ("--primary-py-file", args.primaryResource)
                  childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
                } else if (args.isR) {
                  val mainFile = new Path(args.primaryResource).getName
                  childArgs += ("--primary-r-file", mainFile)
                  childArgs += ("--class", "org.apache.spark.deploy.RRunner")
                } else {
                  if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
                    childArgs += ("--jar", args.primaryResource)
                  }
                  childArgs += ("--class", args.mainClass)
                }
                if (args.childArgs != null) {
                  args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
                }
              }
                    ...
               (childArgs, childClasspath, sysProps, childMainClass)
          }
          
          • 那我们就知道了,在yarn集群环境下childMainClass是

            //cluster:  childMainClass = "org.apache.spark.deploy.yarn.Client"
            //client:	childMainClass = "com.wt.spark.WordCount"
            
    • 因此,我们回到上面来

      mainClass = Utils.classForName(childMainClass)
      

      我们通过反射创建的对象,其实就是org.apache.spark.deploy.yarn.Client

  • 5.接下来执行mainMethod.invoke(null, childArgs.toArray),请看下一篇博文
    spark源码分析submit提交过程(二)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值