spark源码分析之submit的提交过程（一）

最新推荐文章于 2024-08-19 14:50:43 发布

置顶大梁、

最新推荐文章于 2024-08-19 14:50:43 发布

阅读量603

点赞数

分类专栏： spark 文章标签： spark提交过程分析

本文链接：https://blog.csdn.net/qq_41414272/article/details/114676843

版权

spark 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

spark源码分析之submit的提交过程

1.当我们向集群提交如下命令

bin/spark-submit \
--class com.wt.spark.WordCount \
--master yarn \
WordCount.jar \
/input \
/output

2.启动脚本调用的是spark-submit，因此我们直接去看spark-submit脚本

# -z是检查后面变量是否为空（空则真） shell可以在双引号之内引用变量，单引号不可

#这一步作用是检查SPARK_HOME变量是否为空，为空则执行then后面程序

#source命令： source filename作用在当前bash环境下读取并执行filename中的命令
#$0代表shell脚本文件本身的文件名，这里即使spark-submit
#dirname用于取得脚本文件所在目录 dirname $0取得当前脚本文件所在目录
#$(命令)表示返回该命令的结果

#故整个if语句的含义是：如果SPARK_HOME变量没有设置值，则执行当前目录下的find-spark-home脚本文件，设置SPARK_HOME值
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0
#执行spark-class脚本，传递参数org.apache.spark.deploy.SparkSubmit 和"$@"
#这里$@表示之前spark-submit接收到的全部参数
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

在spark-submit脚本最后执行了如下代码
```
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
```
- 所以spark-submit脚本的整体逻辑就是：
  - 首先检查SPARK_HOME是否设置；
  - if 已经设置执行spark-class文件
  - 否则加载执行find-spark-home文件

find-spark-home脚本

#定义一个变量用于后续判断是否存在定义SPARK_HOME的python脚本文件
FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "$(dirname "$0")"; pwd)/find_spark_home.py"

# Short cirtuit if the user already has this set.
##如果SPARK_HOME为不为空值，成功退出程序
if [ ! -z "${SPARK_HOME}" ]; then
   exit 0
# -f用于判断这个文件是否存在并且是否为常规文件，是的话为真，这里不存在为假，执行下面语句，给SPARK_HOME变量赋值
elif [ ! -f "$FIND_SPARK_HOME_PYTHON_SCRIPT" ]; then
  # If we are not in the same directory as find_spark_home.py we are not pip installed so we don't
  # need to search the different Python directories for a Spark installation.
  # Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or
  # spark-submit in another directory we want to use that version of PySpark rather than the
  # pip installed version of PySpark.
  export SPARK_HOME="$(cd "$(dirname "$0")"/..; pwd)"
else
  # We are pip installed, use the Python script to resolve a reasonable SPARK_HOME
  # Default to standard python interpreter unless told otherwise
  if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
     PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}"
  fi
  export SPARK_HOME=$($PYSPARK_DRIVER_PYTHON "$FIND_SPARK_HOME_PYTHON_SCRIPT")
fi

可以看到，如果事先用户没有设定SPARK_HOME的值，这里程序也会自动设置并且将其注册为环境变量，供后面程序使用
当SPARK_HOME的值设定完成之后，就会执行Spark-class文件，这也是我们分析的重要部分，源码如下：

spark-class脚本

#!/usr/bin/env bash
#依旧是检查设置SPARK_HOME的值
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi
#执行load-spark-env.sh脚本文件，主要目的在于加载设定一些变量值
#设定spark-env.sh中的变量值到环境变量中，供后续使用
#设定scala版本变量值
. "${SPARK_HOME}"/bin/load-spark-env.sh

# Find the java binary
#检查设定java环境值
#-n代表检测变量长度是否为0，不为0时候为真
#如果已经安装Java没有设置JAVA_HOME,command -v java返回的值为${JAVA_HOME}/bin/java
if [ -n "${JAVA_HOME}" ]; then
  RUNNER="${JAVA_HOME}/bin/java"
else
  if [ "$(command -v java)" ]; then
    RUNNER="java"
  else
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi

# Find Spark jars.
#-d检测文件是否为目录，若为目录则为真
#设置一些关联Class文件
if [ -d "${SPARK_HOME}/jars" ]; then
  SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
  SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi

if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
  echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
  echo "You need to build Spark with the target \"package\" before running this program." 1>&2
  exit 1
else
  LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi

# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

# For tests
if [[ -n "$SPARK_TESTING" ]]; then
  unset YARN_CONF_DIR
  unset HADOOP_CONF_DIR
fi

# The launcher library will print arguments separated by a NULL character, to allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
# an array that will be used to exec the final command.
#
# The exit code of the launcher is appended to the output, so the parent shell removes it from the
# command array and checks the value to see if the launcher succeeded.
#执行类文件org.apache.spark.launcher.Main，返回解析后的参数
build_command() {
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  printf "%d\0" $?
}

# Turn off posix mode since it does not allow process substitution
#将build_command方法解析后的参数赋给CMD
set +o posix
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <(build_command "$@")

COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}

# Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes
# the code that parses the output of the launcher to get confused. In those cases, check if the
# exit code is an integer, and if it's not, handle it as a special error case.
if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
  echo "${CMD[@]}" | head -n-1 1>&2
  exit 1
fi

if [ $LAUNCHER_EXIT_CODE != 0 ]; then
  exit $LAUNCHER_EXIT_CODE
fi

CMD=("${CMD[@]:0:$LAST}")
#执行CMD中的某个参数类org.apache.spark.deploy.SparkSubmit
exec "${CMD[@]}"

spark-class文件的执行逻辑稍显复杂，总体上应该是这样的：
- 检查SPARK_HOME的值----》加载load-spark-env.sh文件，设定一些需要用到的环境变量，如scala环境值，这其中也加载了spark-env.sh文件-------》检查设定java的执行路径变量值-------》寻找spark jars,设定一些引用相关类的位置变量------》执行类文件org.apache.spark.launcher.Main，返回解析后的参数给CMD-------》判断解析参数是否正确（代表了用户设置的参数是否正确）--------》正确的话执行org.apache.spark.deploy.SparkSubmit这个类

3.执行org.apache.spark.deploy.SparkSubmit类

  def main(args: Array[String]): Unit = {
      //拿到submit脚本传入的参数
    val appArgs = new SparkSubmitArguments(args)
    if (appArgs.verbose) {
      // scalastyle:off println
      printStream.println(appArgs)
      // scalastyle:on println
    }
      //根据传入的参数匹配对应的执行方法
    appArgs.action match {
        //根据传入的参数提交命令
      case SparkSubmitAction.SUBMIT => submit(appArgs)
        //只有standalone和mesos集群模式才触发
      case SparkSubmitAction.KILL => kill(appArgs)
        //只有standalone和mesos集群模式才触发
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
    }
  }

1.调用val appArgs = new SparkSubmitArguments(args)

加载环境变量主要三部分：1.解析命令行数据、2.加载环境变量到内存、3.验证参数

// Set parameters from command line arguments
  try {
    parse(args.asJava)
  } catch {
    case e: IllegalArgumentException =>
      SparkSubmit.printErrorAndExit(e.getMessage())
  }
  // 这里是加载默认的参数，其实就是spark-default.conf里面的值
  mergeDefaultSparkProperties()
  // Remove keys that don't start with "spark." from `sparkProperties`.
  //忽略无效的spark参数，其实就是参数不以spark.开头的
  ignoreNonSparkProperties()
  // Use `sparkProperties` map along with env vars to fill in any missing parameters
  //这里是根据与命令行加载环境参数 master、扩展类、core数量、内存、mainClass等等
  loadEnvironmentArguments()
  //验证参数的有效性
  validateArguments()

2.执行submit(appArgs)

private def submit(args: SparkSubmitArguments): Unit = {
    //准备提交环境
    val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)

    def doRunMain(): Unit = {
      if (args.proxyUser != null) {
        val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
          UserGroupInformation.getCurrentUser())
        try {
          proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
            override def run(): Unit = {
              runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
            }
          })
        } catch {
          case e: Exception =>
            // Hadoop's AuthorizationException suppresses the exception's stack trace, which
            // makes the message printed to the output by the JVM not very helpful. Instead,
            // detect exceptions with empty stack traces here, and treat them differently.

            if (e.getStackTrace().length == 0) {
              // scalastyle:off println
             //在standalone cluster模式。有两种gateways提交方式:
 			//([1)传统的RPC gateway 方式，”使用的是o.a.S.deploy.client/ [2)在Spark 1.3以后，使用new REST-based gateway
			//Spark 1.3之后默认使用new REST-based gateway  , but Spark submit will fail over/但是Spark submit 提交失败的话，使用的是传统的gateways .
                
              printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
              // scalastyle:on println
              exitFn(1)
            } else {
              throw e
            }
        }
      } else {
        runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
      }
    }

在 standalone cluster 模式, 有两种 gateways 提交方式: 由args.useRest 控制
1. 传统的 RPC gateway 方式，使用的是 o.a.s.deploy.Client args.useRest 值为false
2. 在 Spark 1.3 以后，使用 new REST-based gateway args.useRest 值为ture（默认）
  Spark 1.3之后默认使用 new REST-based gateway , but Spark submit will fail over
  但是 Spark submit 提交失败的话，使用的是传统的gateways .

4.执行runmain()方法

private def runMain(
      childArgs: Seq[String],
      childClasspath: Seq[String],
      sysProps: Map[String, String],
      childMainClass: String,
      verbose: Boolean): Unit = {
    // scalastyle:off println
    if (verbose) {
      printStream.println(s"Main class:\n$childMainClass")
      printStream.println(s"Arguments:\n${childArgs.mkString("\n")}")
      printStream.println(s"System properties:\n${sysProps.mkString("\n")}")
      printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}")
      printStream.println("\n")
    }
    // scalastyle:on println
//1.获取类加载器，并加载类
    val loader =
      if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
        new ChildFirstURLClassLoader(new Array[URL](0),
          Thread.currentThread.getContextClassLoader)
      } else {
        new MutableURLClassLoader(new Array[URL](0),
          Thread.currentThread.getContextClassLoader)
      }
    Thread.currentThread.setContextClassLoader(loader)
	//通过类加载器加载jar包中的类
    for (jar <- childClasspath) {
      addJarToClasspath(jar, loader)
    }
	//资源配置
    for ((key, value) <- sysProps) {
      System.setProperty(key, value)
    }

    var mainClass: Class[_] = null

    try {
        //将childMainClass传过来创建此类，也就是说通过反射获取类对象，进里面看一下就知道了
      mainClass = Utils.classForName(childMainClass)
    } catch {
      case e: ClassNotFoundException =>
        e.printStackTrace(printStream)
        if (childMainClass.contains("thriftserver")) {
          // scalastyle:off println
          printStream.println(s"Failed to load main class $childMainClass.")
          printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
          // scalastyle:on println
        }
        System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
      case e: NoClassDefFoundError =>
        e.printStackTrace(printStream)
        if (e.getMessage.contains("org/apache/hadoop/hive")) {
          // scalastyle:off println
          printStream.println(s"Failed to load hive class.")
          printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
          // scalastyle:on println
        }
        System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
    }

    // SPARK-4170
    if (classOf[scala.App].isAssignableFrom(mainClass)) {
      printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
    }
	//通过反射获取的类对象，获取main方法
    val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
    if (!Modifier.isStatic(mainMethod.getModifiers)) {
      throw new IllegalStateException("The main method in the given main class must be static")
    }

    @tailrec
    def findCause(t: Throwable): Throwable = t match {
      case e: UndeclaredThrowableException =>
        if (e.getCause() != null) findCause(e.getCause()) else e
      case e: InvocationTargetException =>
        if (e.getCause() != null) findCause(e.getCause()) else e
      case e: Throwable =>
        e
    }

    try {
      //调用指定类的主方法
      mainMethod.invoke(null, childArgs.toArray)
    } catch {
      case t: Throwable =>
        findCause(t) match {
          case SparkUserAppException(exitCode) =>
            System.exit(exitCode)

          case t: Throwable =>
            throw t
        }
    }
  }

那么问题来了，childMainClass是什么？

childMainClass最初出现在SparkSubmit类中,作为方法的返回值出现

private def submit(args: SparkSubmitArguments): Unit = {
    
    val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)
    
    ...
}

我们进入prepareSubmitEnvironment(args)这个方法中

private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments)
      : (Seq[String], Seq[String], Map[String, String], String) = {
    // Return values
    val childArgs = new ArrayBuffer[String]()
    val childClasspath = new ArrayBuffer[String]()
    val sysProps = new HashMap[String, String]()
    var childMainClass = ""
          ...
    if (deployMode == CLIENT || isYarnCluster) {
      childMainClass = args.mainClass
      if (isUserJar(args.primaryResource)) {
        childClasspath += args.primaryResource
      }
      if (args.jars != null) { childClasspath ++= args.jars.split(",") }
    }
          ...
    if (isYarnCluster) {
      childMainClass = "org.apache.spark.deploy.yarn.Client"
      if (args.isPython) {
        childArgs += ("--primary-py-file", args.primaryResource)
        childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
      } else if (args.isR) {
        val mainFile = new Path(args.primaryResource).getName
        childArgs += ("--primary-r-file", mainFile)
        childArgs += ("--class", "org.apache.spark.deploy.RRunner")
      } else {
        if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
          childArgs += ("--jar", args.primaryResource)
        }
        childArgs += ("--class", args.mainClass)
      }
      if (args.childArgs != null) {
        args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
      }
    }
          ...
     (childArgs, childClasspath, sysProps, childMainClass)
}

那我们就知道了，在yarn集群环境下childMainClass是

//cluster:  childMainClass = "org.apache.spark.deploy.yarn.Client"
//client:	childMainClass = "com.wt.spark.WordCount"

因此，我们回到上面来
```
mainClass = Utils.classForName(childMainClass)
```
我们通过反射创建的对象，其实就是org.apache.spark.deploy.yarn.Client

5.接下来执行mainMethod.invoke(null, childArgs.toArray)，请看下一篇博文
spark源码分析submit提交过程（二）

大梁、

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录