spark源码学习一__spark-shell启动过程源码学习

最新推荐文章于 2023-03-08 11:17:59 发布

Scathon

最新推荐文章于 2023-03-08 11:17:59 发布

阅读量360

点赞数

分类专栏： java spark 大数据

本文链接：https://blog.csdn.net/qq_31617409/article/details/100810416

版权

大数据同时被 3 个专栏收录

10 篇文章 0 订阅

订阅专栏

java

5 篇文章 0 订阅

订阅专栏

spark

5 篇文章 0 订阅

订阅专栏

一、配置spark-shell远程调试

spark-shell --driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888"

二、IDEA本地连接远程JVM

三、spark-shell启动脚本

一共涉及三个脚本：

spark-shell脚本：

function main() {
  if $cygwin; then
    # Workaround for issue involving JLine and Cygwin
    # (see http://sourceforge.net/p/jline/bugs/40/).
    # If you're using the Mintty terminal emulator in Cygwin, may need to set the
    # "Backspace sends ^H" setting in "Keys" section of the Mintty options
    # (see https://github.com/sbt/sbt/issues/562).
    stty -icanon min 1 -echo > /dev/null 2>&1
    export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix"
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
    stty icanon echo > /dev/null 2>&1
  else
    export SPARK_SUBMIT_OPTS
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
  fi
}

可以看出，spark-shell脚本中调用了spark-submit脚本

spark-submit脚本：

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

spark-submit脚本中调用了spark-class脚本。

spark-class脚本：

build_command() {
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
#  echo "$RUNNER -Xmx128m -cp $LAUNCH_CLASSPATH org.apache.spark.launcher.Main $@"
  printf "%d\0" $?
}

# Turn off posix mode since it does not allow process substitution
set +o posix
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <(build_command "$@")

COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}

# Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes
# the code that parses the output of the launcher to get confused. In those cases, check if the
# exit code is an integer, and if it's not, handle it as a special error case.
if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
  echo "${CMD[@]}" | head -n-1 1>&2
  exit 1
fi

if [ $LAUNCHER_EXIT_CODE != 0 ]; then
  exit $LAUNCHER_EXIT_CODE
fi

CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"

spark-class脚本篇幅比较长。我只截取了关键的部分。

关键的一句就是build_command里面的函数里面的。"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" 这一句，可以看出，入口类就是org.apache.spark.launcher.Main

四、源码解读

我们从SparkSubmit类开始入手，因为无论是手动提交我们自己编写的spark作业程序还是spark-shell 都需要spark-submit来提交。

 override def main(args: Array[String]): Unit = {
    val submit = new SparkSubmit() {
      self =>

      override protected def parseArguments(args: Array[String]): SparkSubmitArguments = {
        new SparkSubmitArguments(args) {
          override protected def logInfo(msg: => String): Unit = self.logInfo(msg)

          override protected def logWarning(msg: => String): Unit = self.logWarning(msg)
        }
      }

      override protected def logInfo(msg: => String): Unit = printMessage(msg)

      override protected def logWarning(msg: => String): Unit = printMessage(s"Warning: $msg")

      override def doSubmit(args: Array[String]): Unit = {
        try {
          super.doSubmit(args)
        } catch {
          case e: SparkUserAppException =>
            exitFn(e.exitCode)
        }
      }

    }

    submit.doSubmit(args)
  }

可以看出，SparkSubmit重写了doSubmit方法，并且在主函数中调用了doSubmit提交作业任务，我们看一下此时的args有哪些参数。

里面记录了作业的一些参数，其中--class 就是在spark-submit脚本中传入的参数。

def doSubmit(args: Array[String]): Unit = {
    // Initialize logging if it hasn't been done yet. Keep track of whether logging needs to
    // be reset before the application starts.
    val uninitLog = initializeLogIfNecessary(true, silent = true)

    val appArgs = parseArguments(args)
    if (appArgs.verbose) {
      logInfo(appArgs.toString)
    }
    appArgs.action match {
      case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
      case SparkSubmitAction.KILL => kill(appArgs)
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
      case SparkSubmitAction.PRINT_VERSION => printVersion()
    }
  }

doSubmit方法中，首先就是解析参数，然后解析完毕以后就是一个模式匹配，这里就是匹配到SUBMIT，然后就执行submit

方法。我们先看一下解析参数的执行过程。

这里

private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, String] = sys.env)
  extends SparkSubmitArgumentsParser with Logging {
  //...这里就是一些配置项
  var master: String = null
  var deployMode: String = null
  var executorMemory: String = null
  var executorCores: String = null
  var totalExecutorCores: String = null
  var propertiesFile: String = null
  var driverMemory: String = null
 //.....
  var numExecutors: String = null
  var files: String = null
  var archives: String = null
  var mainClass: String = null
  var primaryResource: String = null
  var name: String = null
  var childArgs: ArrayBuffer[String] = new ArrayBuffer[String]()
  var jars: String = null
  var packages: String = null
  //......
  private var dynamicAllocationEnabled: Boolean = false

  // Standalone cluster mode only
  var supervise: Boolean = false
  var driverCores: String = null
  var submissionToKill: String = null
  var submissionToRequestStatusFor: String = null
  var useRest: Boolean = false // used internally

  /** Default properties present in the currently defined defaults file. */
  lazy val defaultSparkProperties: HashMap[String, String] = {
    val defaultProperties = new HashMap[String, String]()
    if (verbose) {
      logInfo(s"Using properties file: $propertiesFile")
    }
    Option(propertiesFile).foreach { filename =>
      val properties = Utils.getPropertiesFromFile(filename)
      properties.foreach { case (k, v) =>
        defaultProperties(k) = v
      }
      // Property files may contain sensitive information, so redact before printing
      if (verbose) {
        Utils.redact(properties).foreach { case (k, v) =>
          logInfo(s"Adding default property: $k=$v")
        }
      }
    }
    defaultProperties
  }

  // 这里就是解析参数的入口
  parse(args.asJava)

  // Populate `sparkProperties` map from properties file
  mergeDefaultSparkProperties()
  // Remove keys that don't start with "spark." from `sparkProperties`.
  ignoreNonSparkProperties()
  // Use `sparkProperties` map along with env vars to fill in any missing parameters
  loadEnvironmentArguments()

  useRest = sparkProperties.getOrElse("spark.master.rest.enabled", "false").toBoolean

  validateArguments()

protected final void parse(List<String> args) {
// 构建一个解析参数的正则表达式
    Pattern eqSeparatedOpt = Pattern.compile("(--[^=]+)=(.+)");

    int idx = 0;
    for (idx = 0; idx < args.size(); idx++) {
    // --class com.scathon.spark.repl.Main
      String arg = args.get(idx);
      String value = null;

      Matcher m = eqSeparatedOpt.matcher(arg);
      if (m.matches()) {
        //arg = --class
        arg = m.group(1);
// value =null
        value = m.group(2);
      }

      // Look for options with a value.
      String name = findCliOption(arg, opts);
      if (name != null) {
        if (value == null) {
          if (idx == args.size() - 1) {
            throw new IllegalArgumentException(
                String.format("Missing argument for option '%s'.", arg));
          }
          idx++;
//value = com.scathon.spark.repl.Main
          value = args.get(idx);
        }
        if (!handle(name, value)) {
          break;
        }
        continue;
      }

      // Look for a switch.
      name = findCliOption(arg, switches);
      if (name != null) {
        if (!handle(name, null)) {
          break;
        }
        continue;
      }

      if (!handleUnknown(arg)) {
        break;
      }
    }

    if (idx < args.size()) {
      idx++;
    }
    handleExtraArgs(args.subList(idx, args.size()));

opt的是一个二维数组，里面配置了所有的命令行配置参数；

findCliOption方法就是为了判断我们传入的参数是不是spark预定义的参数key。

参数解析完了以后，下一步就是最核心的submit方法提交作业：

submit方法中，定义了一个局部的doRunMain方法。然后在最后调用了该方法。

在doRunMain方法中，调用了start方法，

这里才是真的spark-shell的入口，这里klass=org.apache.spark.repl.Main 也就是，其实真正调用的org.apache.spark.repl.Main的main函数，是通过反射调用的。

红框中，new了一个SparkILoop，实例化SparkILoop的过程中，完成了很关键的初始化就是新建了SparkContext

这里的多行字符串里面就是定义的sc的初始化逻辑，也就是我们在spark-shell在中使用的sc的由来。

Scathon

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark源码学习一__spark-shell启动过程源码学习

一、配置spark-shell远程调试spark-shell --driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888"二、IDEA本地连接远程JVM三、spark-shell启动脚本一共涉及三个脚本：spark-shell脚本：funct...
复制链接

扫一扫