Spark2.3 最新源码阅读 - spark启动脚本

最新推荐文章于 2022-08-23 07:30:00 发布

hddyxl

最新推荐文章于 2022-08-23 07:30:00 发布

阅读量835

点赞数 1

分类专栏： spark源码文章标签： spark 源码 2.3 bash

本文链接：https://blog.csdn.net/hddyxl/article/details/81085519

版权

spark源码专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最近一直在看源码方面的东西，发现关于spark2.3的源码解读还是比较少，索性自己试着写写。

首先就从脚本阅读开始，希望能做到每天看一点，收获一点

脚本核心主要有:spark-shell spark-submit spark-class load-spark-env find-spark-home。位于源码 spark/bin下面

spark-shell

主要功能：

判断系统环境开启poisx
设置加载java classpath

# Shell script for starting the Spark Shell REPL
# 开始spark shell交互解释器的脚本
cygwin=false    #默认是linux系统
case "$(uname)" in  #获取系统名称 如果是cygwin则设置为true
  CYGWIN*) cygwin=true;;
esac

# Enter posix mode for bash
# 开启posix模式 （主要是跨平台）
set -o posix

# 如果未设置spark_home则设置spark_home
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"

# SPARK-4161: scala does not assume use of the java classpath,
# so we need to add the "-Dscala.usejavacp=true" flag manually. We
# do this specifically for the Spark shell because the scala REPL
# has its own class loader, and any additional classpath specified
# through spark.driver.extraClassPath is not automatically propagated.
# SPARK-4161: scala并不假定使用java的classpath，因此我们要手动设置
# 增加 "-Dscala.usejavacp=true"。因为scala交互脚本有自己的class加载器，
# 并且通过spark.driver.extraClassPath指定的额外classpath不能自动传播，
# 所以我们需要专门指定这些，
#
SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true"

function main() {
  if $cygwin; then
    # Workaround for issue involving JLine and Cygwin
    # (see http://sourceforge.net/p/jline/bugs/40/).
    # If you're using the Mintty terminal emulator in Cygwin, may need to set the
    # "Backspace sends ^H" setting in "Keys" section of the Mintty options
    # (see https://github.com/sbt/sbt/issues/562).
    stty -icanon min 1 -echo > /dev/null 2>&1
    export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix"
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
    stty icanon echo > /dev/null 2>&1
  else #linux系统，添加环境变量，执行repl.Main
    export SPARK_SUBMIT_OPTS
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
  fi
}

# Copy restore-TTY-on-exit functions from Scala script so spark-shell exits properly even in
# binary distribution of Spark where Scala is not installed
# 从scala脚本复制restore-TTY-on-exit函数，这样在没有安装scala的二进制分布式环境也可以
# 退出spark-shell
exit_status=127
saved_stty=""

# restore stty settings (echo in particular)
function restoreSttySettings() {
  stty $saved_stty
  saved_stty=""
}

function onExit() {
  if [[ "$saved_stty" != "" ]]; then
    restoreSttySettings
  fi
  exit $exit_status
}

# to reenable echo if we are interrupted before completing.
trap onExit INT

# save terminal settings
saved_stty=$(stty -g 2>/dev/null)
# clear on error so we don't later try to restore them
if [[ ! $? ]]; then
  saved_stty=""
fi

main "$@"

# record the exit status lest it be overwritten:
# then reenable echo and propagate the code.
exit_status=$?
onExit

spark-submit

功能：

通过spark-class加载SparkSubmit

if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi
# disable randomized hash for string in Python 3.3+
# 在python3.3+中禁用string的随机hash设置hashseed=0
export PYTHONHASHSEED=0
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

spark-class

功能：

定位java，加载java jars
加载spark jars
启动器参数加载

if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi

# 加载spark环境  load-spark-env.sh
. "${SPARK_HOME}"/bin/load-spark-env.sh

# Find the java binary
#找到java脚本
if [ -n "${JAVA_HOME}" ]; then
RUNNER="${JAVA_HOME}/bin/java"
else
if [ "$(command -v java)" ]; then
RUNNER="java"
else
echo "JAVA_HOME is not set" >&2
exit 1
fi
Fi

# Find Spark jars.
# 设置spark jar目录,如果jars不存在则退出，设置启动时加载所有jar
if [ -d "${SPARK_HOME}/jars" ]; then
SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi
if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
echo "You need to build Spark with the target \"package\" before running this program." 1>&2
exit 1
else
LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
Fi

# Add the launcher build dir to the classpath if requested.
# 如果需要的话，将构建启动目录加入classpath
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

# For tests
# 非测试则取消两个环境变量
if [[ -n "$SPARK_TESTING" ]]; then
unset YARN_CONF_DIR
unset HADOOP_CONF_DIR
Fi

# The launcher library will print arguments separated by a NULL character, to allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
# an array that will be used to exec the final command.
#
# The exit code of the launcher is appended to the output, so the parent shell removes it from the
# command array and checks the value to see if the launcher succeeded.
# 启动器将会以空格为间隔输出各个参数，允许通过shell来额外解释这些参数 。通过while循环来
# 读取这些存放于array中用来执行最终命令的参数。
#
# 启动器的退出代码追加在output参数中，因此父shell会从命令数组中删除它并且通过验证数值来
#确定是否启动成功
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
printf "%d\0" $?
}

# Turn off posix mode since it does not allow process substitution
#关闭posxi模式，因为它不允许进程替换
set +o posix
CMD=()
while IFS= read -d '' -r ARG; do
CMD+=("$ARG")
done < <(build_command "$@")
COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}
# Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes
# the code that parses the output of the launcher to get confused. In those cases, check if the
# exit code is an integer, and if it's not, handle it as a special error case.
if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
echo "${CMD[@]}" | head -n-1 1>&2
exit 1
fi
if [ $LAUNCHER_EXIT_CODE != 0 ]; then
exit $LAUNCHER_EXIT_CODE
fi
CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"

load-spark-env

功能：

计算spark的目录
加载环境变量
选择scala版本

# Figure out where Spark is installed
# 算出spark安装在哪
if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi
if [ -z "$SPARK_ENV_LOADED" ]; then
export SPARK_ENV_LOADED=1
export SPARK_CONF_DIR="${SPARK_CONF_DIR:-"${SPARK_HOME}"/conf}"
if [ -f "${SPARK_CONF_DIR}/spark-env.sh" ]; then

# Promote all variable declarations to environment (exported) variables
# 促使所有变量申请为环境变量
set -a
. "${SPARK_CONF_DIR}/spark-env.sh"
set +a
fi
fi

# Setting SPARK_SCALA_VERSION if not already set.
# 设置spark_scala_version
if [ -z "$SPARK_SCALA_VERSION" ]; then
ASSEMBLY_DIR2="${SPARK_HOME}/assembly/target/scala-2.11"
ASSEMBLY_DIR1="${SPARK_HOME}/assembly/target/scala-2.12"
if [[ -d "$ASSEMBLY_DIR2" && -d "$ASSEMBLY_DIR1" ]]; then
echo -e "Presence of build for multiple Scala versions detected." 1>&2
echo -e 'Either clean one of them or, export SPARK_SCALA_VERSION in spark-env.sh.' 1>&2
exit 1
fi
if [ -d "$ASSEMBLY_DIR2" ]; then
export SPARK_SCALA_VERSION="2.11"
else
export SPARK_SCALA_VERSION="2.12"
fi
fi

find-spark-home

功能：

确定spark_home
pyspark的验证

# Attempts to find a proper value for SPARK_HOME. Should be included using "source" directive.
# 试图找到spark_home 的适当值，需要使用source指令
FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "$(dirname "$0")"; pwd)/find_spark_home.py"
# Short circuit if the user already has this set.
# 如果用户已经设置spark_home，形成短回路。直接退出
if [ ! -z "${SPARK_HOME}" ]; then
exit 0
elif [ ! -f "$FIND_SPARK_HOME_PYTHON_SCRIPT" ]; then
# If we are not in the same directory as find_spark_home.py we are not pip installed so we don't
# need to search the different Python directories for a Spark installation.
# Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or
# spark-submit in another directory we want to use that version of PySpark rather than the
# pip installed version of PySpark.
# 如果在当前文件夹下面没有找到find_spark_home.py，是由于还没有通过pip安装
# 若是这样，我们不必到其他文件夹中去搜索spark的安装。
# 只需要注意，如果用户已经安装pyspark，但是直接调用其他文件夹中的pyspark或者
# spark-submit，我们将使用当前版本而不是通过pip安装的版本
export SPARK_HOME="$(cd "$(dirname "$0")"/..; pwd)"
else
# We are pip installed, use the Python script to resolve a reasonable SPARK_HOME
# Default to standard python interpreter unless told otherwise
# 如果我们已经pip安装，通过python脚本决定合理的spark_home。
# 默认使用标准python解释器除非另外指定
if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}"
fi
export SPARK_HOME=$($PYSPARK_DRIVER_PYTHON "$FIND_SPARK_HOME_PYTHON_SCRIPT")
fi

hddyxl

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark2.3 最新源码阅读 - spark启动脚本

最近一直在看源码方面的东西，发现关于spark2.3的源码解读还是比较少，索性自己试着写写。首先就从脚本阅读开始，希望能做到每天看一点，收获一点脚本核心主要有:spark-shell spark-submit spark-class load-spark-env find-spark-home。位于源码 spark/bin下面spark-shell主要功能：判断系统环境开启p...
复制链接

扫一扫