PySpark启动过程解密

最新推荐文章于 2024-03-21 12:17:01 发布

Hadoop_SC

最新推荐文章于 2024-03-21 12:17:01 发布

阅读量1.3k

点赞数

分类专栏： Hadoop实操

本文链接：https://blog.csdn.net/hadoop_sc/article/details/104592792

版权

本文由平安银行零售大数据团队的李海强撰写，详细解析了三种启动PySpark的方法：使用`spark-submit`、直接运行Python文件和通过`pyspark`。通过代码分析，阐述了这些方法在Spark提交、Python环境交互和Py4j GatewayServer的角色，揭示了它们的共同点和区别，以及在不同场景下的应用。

摘要由CSDN通过智能技术生成

原文作者：李海强，来自平安银行零售大数据团队

前言

作为数据工程师，你可能会碰到过很多种启动PySpark的方法，可能搞不懂这些方法有什么共同点、有什么区别，不同的方法对程序开发、部署有什么影响，今天我们一起分析一下这些启动PySpark的方法。

以下代码分析都是基于spark-2.4.4版本展开的，为了避免歧义，务必对照这个版本的Spark深入理解。

启动PySpark的方法

启动PySpark代码分析

下面我们分别来分析一下三种方法的代码实现过程。

/path/to/spark-submit python_file.py

spark-submit是一个shell脚本
spark-submit调用shell命令spark-class org.apache.spark.deploy.SparkSubmit python_file.py
spark-class，line 71，执行jvm org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit python_file.py重写SparkSubmit参数

# The launcher library will print arguments separated by a NULL character, to allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
# an array that will be used to exec the final command.
#
# The exit code of the launcher is appended to the output, so the parent shell removes it from the
# command array and checks the value to see if the launcher succeeded.
build_command() {
   
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  printf "%d\0" $?
}

# Turn off posix mode since it does not allow process substitution
set +o posix
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <(build_command "$@")

4.深入分析一下org.apache.spark.launcher.Main如何重写SparkSubmit参数，可以看到buildCommand分三种情况，分别对应三种不同的场景，PySpark shell、Spark R shell、Spark submit，场景对用不同的class

 /**
   * This constructor is used when invoking spark-submit; it parses and validates arguments
   * provided by the user on the command line.
   */
  SparkSubmitCommandBuilder(List<String> args) {
   
    this.allowsMixedArguments = false;
    this.parsedArgs = new ArrayList<>();
    boolean isExample = false;
    List<String> submitArgs = args;
    this.userArgs = Collections.emptyList();

    if (args.size() > 0) {
   
      switch (args.get(0)) {
   
        case PYSPARK_SHELL:
          this.allowsMixedArguments = true;
          appResource = PYSPARK_SHELL;
          submitArgs = args.subList(1, args.size());
          break;

        case SPARKR_SHELL:
          this.allowsMixedArguments = true;
          appResource = SPARKR_SHELL;
          submitArgs = args.subList(1, args.size());
          break;

        case RUN_EXAMPLE:
          isExample = true;
          appResource = SparkLauncher.NO_RESOURCE;
          submitArgs = args.subList(1, args.size());
      }

      this.isExample = isExample;
      OptionParser parser = new OptionParser(true);
      parser.parse(submitArgs);
      this.isSpecialCommand = parser.isSpecialCommand;
    } else {
   
      this.isExample = isExample;
      this.isSpecialCommand = true;
    }
  }

  @Override
  public List<String> buildCommand(Map<String, String> env)
      throws IOException, IllegalArgumentException {
   
    if (PYSPARK_SHELL.equals(appResource) && !isSpecialCommand) {
   
      return buildPySparkShellCommand(env);
    } else if (SPARKR_SHELL.equals(appResource) && !isSpecialCommand) {
   
      return buildSparkRCommand(env);
    } else {
   
      return buildSparkSubmitCommand(env);
    }
  }