spark源码分析之submit的提交过程
- 1.当我们向集群提交如下命令
bin/spark-submit \
--class com.wt.spark.WordCount \
--master yarn \
WordCount.jar \
/input \
/output
-
2.启动脚本调用的是spark-submit,因此我们直接去看spark-submit脚本
# -z是检查后面变量是否为空(空则真) shell可以在双引号之内引用变量,单引号不可 #这一步作用是检查SPARK_HOME变量是否为空,为空则执行then后面程序 #source命令: source filename作用在当前bash环境下读取并执行filename中的命令 #$0代表shell脚本文件本身的文件名,这里即使spark-submit #dirname用于取得脚本文件所在目录 dirname $0取得当前脚本文件所在目录 #$(命令)表示返回该命令的结果 #故整个if语句的含义是:如果SPARK_HOME变量没有设置值,则执行当前目录下的find-spark-home脚本文件,设置SPARK_HOME值 if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi # disable randomized hash for string in Python 3.3+ export PYTHONHASHSEED=0 #执行spark-class脚本,传递参数org.apache.spark.deploy.SparkSubmit 和"$@" #这里$@表示之前spark-submit接收到的全部参数 exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
-
在spark-submit脚本最后执行了如下代码
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
- 所以spark-submit脚本的整体逻辑就是:
- 首先 检查SPARK_HOME是否设置;
- if 已经设置 执行spark-class文件
- 否则加载执行find-spark-home文件
- 所以spark-submit脚本的整体逻辑就是:
-
-
find-spark-home脚本
#定义一个变量用于后续判断是否存在定义SPARK_HOME的python脚本文件 FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "$(dirname "$0")"; pwd)/find_spark_home.py" # Short cirtuit if the user already has this set. ##如果SPARK_HOME为不为空值,成功退出程序 if [ ! -z "${SPARK_HOME}" ]; then exit 0 # -f用于判断这个文件是否存在并且是否为常规文件,是的话为真,这里不存在为假,执行下面语句,给SPARK_HOME变量赋值 elif [ ! -f "$FIND_SPARK_HOME_PYTHON_SCRIPT" ]; then # If we are not in the same directory as find_spark_home.py we are not pip installed so we don't # need to search the different Python directories for a Spark installation. # Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or # spark-submit in another directory we want to use that version of PySpark rather than the # pip installed version of PySpark. export SPARK_HOME="$(cd "$(dirname "$0")"/..; pwd)" else # We are pip installed, use the Python script to resolve a reasonable SPARK_HOME # Default to standard python interpreter unless told otherwise if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}" fi export SPARK_HOME=$($PYSPARK_DRIVER_PYTHON "$FIND_SPARK_HOME_PYTHON_SCRIPT") fi
- 可以看到,如果事先用户没有设定SPARK_HOME的值,这里程序也会自动设置并且将其注册为环境变量,供后面程序使用
- 当SPARK_HOME的值设定完成之后,就会执行Spark-class文件,这也是我们分析的重要部分,源码如下:
-
spark-class脚本
#!/usr/bin/env bash #依旧是检查设置SPARK_HOME的值 if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi #执行load-spark-env.sh脚本文件,主要目的在于加载设定一些变量值 #设定spark-env.sh中的变量值到环境变量中,供后续使用 #设定scala版本变量值 . "${SPARK_HOME}"/bin/load-spark-env.sh # Find the java binary #检查设定java环境值 #-n代表检测变量长度是否为0,不为0时候为真 #如果已经安装Java没有设置JAVA_HOME,command -v java返回的值为${JAVA_HOME}/bin/java if [ -n "${JAVA_HOME}" ]; then RUNNER="${JAVA_HOME}/bin/java" else if [ "$(command -v java)" ]; then RUNNER="java" else echo "JAVA_HOME is not set" >&2 exit 1 fi fi # Find Spark jars. #-d检测文件是否为目录,若为目录则为真 #设置一些关联Class文件 if [ -d "${SPARK_HOME}/jars" ]; then SPARK_JARS_DIR="${SPARK_HOME}/jars" else SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars" fi if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2 echo "You need to build Spark with the target \"package\" before running this program." 1>&2 exit 1 else LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*" fi # Add the launcher build dir to the classpath if requested. if [ -n "$SPARK_PREPEND_CLASSES" ]; then LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH" fi # For tests if [[ -n "$SPARK_TESTING" ]]; then unset YARN_CONF_DIR unset HADOOP_CONF_DIR fi # The launcher library will print arguments separated by a NULL character, to allow arguments with # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # # The exit code of the launcher is appended to the output, so the parent shell removes it from the # command array and checks the value to see if the launcher succeeded. #执行类文件org.apache.spark.launcher.Main,返回解析后的参数 build_command() { "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" printf "%d\0" $? } # Turn off posix mode since it does not allow process substitution #将build_command方法解析后的参数赋给CMD set +o posix CMD=() while IFS= read -d '' -r ARG; do CMD+=("$ARG") done < <(build_command "$@") COUNT=${#CMD[@]} LAST=$((COUNT - 1)) LAUNCHER_EXIT_CODE=${CMD[$LAST]} # Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes # the code that parses the output of the launcher to get confused. In those cases, check if the # exit code is an integer, and if it's not, handle it as a special error case. if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then echo "${CMD[@]}" | head -n-1 1>&2 exit 1 fi if [ $LAUNCHER_EXIT_CODE != 0 ]; then exit $LAUNCHER_EXIT_CODE fi CMD=("${CMD[@]:0:$LAST}") #执行CMD中的某个参数类org.apache.spark.deploy.SparkSubmit exec "${CMD[@]}"
- spark-class文件的执行逻辑稍显复杂,总体上应该是这样的:
- 检查SPARK_HOME的值----》加载load-spark-env.sh文件,设定一些需要用到的环境变量,如scala环境值,这其中也加载了spark-env.sh文件-------》检查设定java的执行路径变量值-------》寻找spark jars,设定一些引用相关类的位置变量------》执行类文件org.apache.spark.launcher.Main,返回解析后的参数给CMD-------》判断解析参数是否正确(代表了用户设置的参数是否正确)--------》正确的话执行org.apache.spark.deploy.SparkSubmit这个类
- spark-class文件的执行逻辑稍显复杂,总体上应该是这样的:
-
3.执行org.apache.spark.deploy.SparkSubmit类
def main(args: Array[String]): Unit = { //拿到submit脚本传入的参数 val appArgs = new SparkSubmitArguments(args) if (appArgs.verbose) { // scalastyle:off println printStream.println(appArgs) // scalastyle:on println } //根据传入的参数匹配对应的执行方法 appArgs.action match { //根据传入的参数提交命令 case SparkSubmitAction.SUBMIT => submit(appArgs) //只有standalone和mesos集群模式才触发 case SparkSubmitAction.KILL => kill(appArgs) //只有standalone和mesos集群模式才触发 case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs) } }
-
1.调用val appArgs = new SparkSubmitArguments(args)
加载环境变量主要三部分:1.解析命令行数据、2.加载环境变量到内存、3.验证参数
// Set parameters from command line arguments try { parse(args.asJava) } catch { case e: IllegalArgumentException => SparkSubmit.printErrorAndExit(e.getMessage()) } // 这里是加载默认的参数,其实就是spark-default.conf里面的值 mergeDefaultSparkProperties() // Remove keys that don't start with "spark." from `sparkProperties`. //忽略无效的spark参数,其实就是参数不以spark.开头的 ignoreNonSparkProperties() // Use `sparkProperties` map along with env vars to fill in any missing parameters //这里是根据与命令行加载环境参数 master、扩展类、core数量、内存、mainClass等等 loadEnvironmentArguments() //验证参数的有效性 validateArguments()
-
2.执行submit(appArgs)
private def submit(args: SparkSubmitArguments): Unit = { //准备提交环境 val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args) def doRunMain(): Unit = { if (args.proxyUser != null) { val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser, UserGroupInformation.getCurrentUser()) try { proxyUser.doAs(new PrivilegedExceptionAction[Unit]() { override def run(): Unit = { runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose) } }) } catch { case e: Exception => // Hadoop's AuthorizationException suppresses the exception's stack trace, which // makes the message printed to the output by the JVM not very helpful. Instead, // detect exceptions with empty stack traces here, and treat them differently. if (e.getStackTrace().length == 0) { // scalastyle:off println //在standalone cluster模式。有两种gateways提交方式: //([1)传统的RPC gateway 方式,”使用的是o.a.S.deploy.client/ [2)在Spark 1.3以后,使用new REST-based gateway //Spark 1.3之后默认使用new REST-based gateway , but Spark submit will fail over/但是Spark submit 提交失败的话,使用的是传统的gateways . printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}") // scalastyle:on println exitFn(1) } else { throw e } } } else { runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose) } }
-
在 standalone cluster 模式, 有两种 gateways 提交方式: 由args.useRest 控制
- 传统的 RPC gateway 方式 ,使用的是 o.a.s.deploy.Client args.useRest 值为false
- 在 Spark 1.3 以后,使用 new REST-based gateway args.useRest 值为ture(默认)
Spark 1.3之后默认使用 new REST-based gateway , but Spark submit will fail over
但是 Spark submit 提交失败的话 , 使用的是传统的gateways .
-
-
-
4.执行runmain()方法
private def runMain( childArgs: Seq[String], childClasspath: Seq[String], sysProps: Map[String, String], childMainClass: String, verbose: Boolean): Unit = { // scalastyle:off println if (verbose) { printStream.println(s"Main class:\n$childMainClass") printStream.println(s"Arguments:\n${childArgs.mkString("\n")}") printStream.println(s"System properties:\n${sysProps.mkString("\n")}") printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}") printStream.println("\n") } // scalastyle:on println //1.获取类加载器,并加载类 val loader = if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) { new ChildFirstURLClassLoader(new Array[URL](0), Thread.currentThread.getContextClassLoader) } else { new MutableURLClassLoader(new Array[URL](0), Thread.currentThread.getContextClassLoader) } Thread.currentThread.setContextClassLoader(loader) //通过类加载器加载jar包中的类 for (jar <- childClasspath) { addJarToClasspath(jar, loader) } //资源配置 for ((key, value) <- sysProps) { System.setProperty(key, value) } var mainClass: Class[_] = null try { //将childMainClass传过来创建此类,也就是说通过反射获取类对象,进里面看一下就知道了 mainClass = Utils.classForName(childMainClass) } catch { case e: ClassNotFoundException => e.printStackTrace(printStream) if (childMainClass.contains("thriftserver")) { // scalastyle:off println printStream.println(s"Failed to load main class $childMainClass.") printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.") // scalastyle:on println } System.exit(CLASS_NOT_FOUND_EXIT_STATUS) case e: NoClassDefFoundError => e.printStackTrace(printStream) if (e.getMessage.contains("org/apache/hadoop/hive")) { // scalastyle:off println printStream.println(s"Failed to load hive class.") printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.") // scalastyle:on println } System.exit(CLASS_NOT_FOUND_EXIT_STATUS) } // SPARK-4170 if (classOf[scala.App].isAssignableFrom(mainClass)) { printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.") } //通过反射获取的类对象,获取main方法 val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass) if (!Modifier.isStatic(mainMethod.getModifiers)) { throw new IllegalStateException("The main method in the given main class must be static") } @tailrec def findCause(t: Throwable): Throwable = t match { case e: UndeclaredThrowableException => if (e.getCause() != null) findCause(e.getCause()) else e case e: InvocationTargetException => if (e.getCause() != null) findCause(e.getCause()) else e case e: Throwable => e } try { //调用指定类的主方法 mainMethod.invoke(null, childArgs.toArray) } catch { case t: Throwable => findCause(t) match { case SparkUserAppException(exitCode) => System.exit(exitCode) case t: Throwable => throw t } } }
-
那么问题来了,childMainClass是什么?
-
childMainClass最初出现在SparkSubmit类中,作为方法的返回值出现
private def submit(args: SparkSubmitArguments): Unit = { val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args) ... }
-
我们进入prepareSubmitEnvironment(args)这个方法中
private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments) : (Seq[String], Seq[String], Map[String, String], String) = { // Return values val childArgs = new ArrayBuffer[String]() val childClasspath = new ArrayBuffer[String]() val sysProps = new HashMap[String, String]() var childMainClass = "" ... if (deployMode == CLIENT || isYarnCluster) { childMainClass = args.mainClass if (isUserJar(args.primaryResource)) { childClasspath += args.primaryResource } if (args.jars != null) { childClasspath ++= args.jars.split(",") } } ... if (isYarnCluster) { childMainClass = "org.apache.spark.deploy.yarn.Client" if (args.isPython) { childArgs += ("--primary-py-file", args.primaryResource) childArgs += ("--class", "org.apache.spark.deploy.PythonRunner") } else if (args.isR) { val mainFile = new Path(args.primaryResource).getName childArgs += ("--primary-r-file", mainFile) childArgs += ("--class", "org.apache.spark.deploy.RRunner") } else { if (args.primaryResource != SparkLauncher.NO_RESOURCE) { childArgs += ("--jar", args.primaryResource) } childArgs += ("--class", args.mainClass) } if (args.childArgs != null) { args.childArgs.foreach { arg => childArgs += ("--arg", arg) } } } ... (childArgs, childClasspath, sysProps, childMainClass) }
-
那我们就知道了,在yarn集群环境下childMainClass是
//cluster: childMainClass = "org.apache.spark.deploy.yarn.Client" //client: childMainClass = "com.wt.spark.WordCount"
-
-
-
-
因此,我们回到上面来
mainClass = Utils.classForName(childMainClass)
我们通过反射创建的对象,其实就是org.apache.spark.deploy.yarn.Client
-
-
5.接下来执行mainMethod.invoke(null, childArgs.toArray),请看下一篇博文
spark源码分析submit提交过程(二)