原文作者:李海强,来自平安银行零售大数据团队
前言
作为数据工程师,你可能会碰到过很多种启动PySpark的方法,可能搞不懂这些方法有什么共同点、有什么区别,不同的方法对程序开发、部署有什么影响,今天我们一起分析一下这些启动PySpark的方法。
以下代码分析都是基于spark-2.4.4版本展开的,为了避免歧义,务必对照这个版本的Spark深入理解。
启动PySpark的方法
启动PySpark代码分析
下面我们分别来分析一下三种方法的代码实现过程。
/path/to/spark-submit python_file.py
-
spark-submit是一个shell脚本
-
spark-submit调用shell命令spark-class org.apache.spark.deploy.SparkSubmit python_file.py
-
spark-class,line 71,执行jvm org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit python_file.py重写SparkSubmit参数
# The launcher library will print arguments separated by a NULL character, to allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
# an array that will be used to exec the final command.
#
# The exit code of the launcher is appended to the output, so the parent shell removes it from the
# command array and checks the value to see if the launcher succeeded.
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
printf "%d\0" $?
}
# Turn off posix mode since it does not allow process substitution
set +o posix
CMD=()
while IFS= read -d '' -r ARG; do
CMD+=("$ARG")
done < <(build_command "$@")
4.深入分析一下org.apache.spark.launcher.Main如何重写SparkSubmit参数,可以看到buildCommand分三种情况,分别对应三种不同的场景,PySpark shell、Spark R shell、Spark submit,场景对用不同的class
/**
* This constructor is used when invoking spark-submit; it parses and validates arguments
* provided by the user on the command line.
*/
SparkSubmitCommandBuilder(List<String> args) {
this.allowsMixedArguments = false;
this.parsedArgs = new ArrayList<>();
boolean isExample = false;
List<String> submitArgs = args;
this.userArgs = Collections.emptyList();
if (args.size() > 0) {
switch (args.get(0)) {
case PYSPARK_SHELL:
this.allowsMixedArguments = true;
appResource = PYSPARK_SHELL;
submitArgs = args.subList(1, args.size());
break;
case SPARKR_SHELL:
this.allowsMixedArguments = true;
appResource = SPARKR_SHELL;
submitArgs = args.subList(1, args.size());
break;
case RUN_EXAMPLE:
isExample = true;
appResource = SparkLauncher.NO_RESOURCE;
submitArgs = args.subList(1, args.size());
}
this.isExample = isExample;
OptionParser parser = new OptionParser(true);
parser.parse(submitArgs);
this.isSpecialCommand = parser.isSpecialCommand;
} else {
this.isExample = isExample;
this.isSpecialCommand = true;
}
}
@Override
public List<String> buildCommand(Map<String, String> env)
throws IOException, IllegalArgumentException {
if (PYSPARK_SHELL.equals(appResource) && !isSpecialCommand) {
return buildPySparkShellCommand(env);
} else if (SPARKR_SHELL.equals(appResource) && !isSpecialCommand) {
return buildSparkRCommand(env);
} else {
return buildSparkSubmitCommand(env);
}
}
-
这里buildCommand返回的class是org.apache.spark.deploy.SparkSubmit,参数是python_file.py
-
因为SparkSubmit的参数是.py文件,所以选择class org.apache.spark.deploy.PythonRunner
最后看一下PythonRunner的实现,首先创建一个py4j.GatewayServer的线程,用于接收python发起的请求,然后起一个子进程执行用户的python代码python_