在spark-shell提交spark任务或者在某个clinet机器命令行运行spark-submit脚本提交任务,其实都执行的spark-submit脚本。
spark-submit脚本代码
if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi
# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@“
这段代码很简单,先检查是否设置了${SPARK_HOME},没有则去寻找;然后调用spark-class运行SparkSubmit代码。那再看看spark-class
脚本,只看重要的几个片段:
# Find the java binary
if [ -n "${JAVA_HOME}" ]; then
RUNNER="${JAVA_HOME}/bin/java"
else
if [ "$(command -v java)" ]; then
RUNNER="java"
else
echo "JAVA_HOME is not set" >&2
exit 1
fi
fi
…
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
printf "%d\0" $?
}
…
CMD=()
while IFS= read -d '' -r ARG; do
CMD+=("$ARG")
done < <(build_command "$@“)
…
exec "${CMD[@]}"
使用java命令来运行main函数,前面传入的SparkSubmit和参数都作为这个main的参数。这个main函数功能就是根据输入生成最终要执行的命令。
来看看org.apache.spark.launcher.Main main函数的流程:
public static void main(String[] argsArray)
throws Exception {
checkArgument(argsArray.
length > 0,
"Not enough arguments: missing class name.");
List<String> args =
new ArrayList<>(Arrays.
asList(argsArray));
String className = args.remove(0);
boolean printLaunchCommand = !
isEmpty(System.
getenv(
"SPARK_PRINT_LAUNCH_COMMAND"));
AbstractCommandBuilder builder;
if (className.equals(
"org.apache.spark.deploy.SparkSubmit")) {
try {
builder =
new SparkSubmitCommandBuilder(args);
}
catch (IllegalArgumentException e) {
…
}
List<String> cmd = builder.buildCommand(env);
...
}
第一个参数是
org.apache.spark.deploy.SparkSubmit,那么就使用剩下的参数传入
SparkSubmitCommandBuilder。
@Override
public List<String> buildCommand(Map<String, String> env)
throws IOException, IllegalArgumentException {
if (
PYSPARK_SHELL.equals(
appResource) &&
isAppResourceReq) {
return buildPySparkShellCommand(env);
}
else if (
SPARKR_SHELL.equals(
appResource) &&
isAppResourceReq) {
return buildSparkRCommand(env);
}
else {
return buildSparkSubmitCommand(env);
}
}
根据不同的场景输出不同的命令字符串,最后在spark-class脚本执行,见spark-class脚本最后一行exec命令。
不管怎样,最终都会运行org.apache.spark.deploy.SparkSubmit main函数。
override def main(args: Array[String]): Unit = {
val appArgs =
new SparkSubmitArguments(args)
if (appArgs.
verbose) {
// scalastyle:off println
printStream.println(appArgs)
// scalastyle:on println
}
appArgs.
action
match {
case SparkSubmitAction.
SUBMIT =>
submit(appArgs)
case SparkSubmitAction.
KILL =>
kill(appArgs)
case SparkSubmitAction.
REQUEST_STATUS =>
requestStatus(appArgs)
}
}
我们进入submit流程看看,主要是2个流程,先解析args参数获得要运行的类、class路径,以及其他参数。
val (childArgs, childClasspath, sysProps, childMainClass) =
prepareSubmitEnvironment(args)
然后调用
runMain(childArgs, childClasspath, sysProps, childMainClass, args.
verbose)
runMain会使用java反射功能取出childMainClass main函数并执行,childMainClass就是提交任务时输入的任务类。