1. Spark源码解析之启动脚本解析

最新推荐文章于 2022-05-17 11:56:09 发布

訾零

最新推荐文章于 2022-05-17 11:56:09 发布

阅读量902

点赞数 2

分类专栏： Spark 文章标签： spark源码剖析

本文链接：https://blog.csdn.net/lingeio/article/details/96379627

版权

Spark 专栏收录该内容

41 篇文章 5 订阅

订阅专栏

从零开始解读Spark源码。前期记录详细点。

Spark启动方式主要有两种：start-all.sh一键启动，start-master.sh和start-slave.sh单独启动master和slave。

运行Spark的方式也是两种：spark-shell和spark-submit。

这里解析所有相关启动脚本。

Spark集群启动脚本
start-all.sh
start-master.sh
start-slave.sh
spark-shell
spark-submit

Spark集群启动脚本

start-all.sh

主要是启动spark-config.sh、start-master.sh、start-slaves.sh。

也就是负责：加载conf目录，启动master节点、启动worker节点。

因为启动方式不同，启动的脚本不同，所以spark目录、conf目录的环境变量其他脚本都会导入。

# 加载spark目录为环境变量
if [ -z "${SPARK_HOME}" ]; then    #Spark环境变量为空，则执行下面export
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"    #切换到脚本父目录,导入SPARK_HOME环境变量
# "$0" 当前脚本名
# `dirname "$0"` 当前脚本目录
# $(cd "`dirname "$0"`"/..;pwd) 先切换到脚本父目录，并获取路径。就是获取Spark根目录路径
fi

# Load the Spark configuration
. "${SPARK_HOME}/sbin/spark-config.sh"    #启动sparkconf设置脚本

# Start Master
"${SPARK_HOME}/sbin"/start-master.sh    #启动master的脚本

# Start Workers
"${SPARK_HOME}/sbin"/start-slaves.sh    #启动workers的脚本

spark-config.sh

主要是：导入SPARK_CONF_DIR、添加PySPARK到Python路径。

这个脚本就是负责：导入conf 目录为环境变量、导入PySpark环境变量。

# symlink and absolute path should rely on SPARK_HOME to resolve
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# 导入sparkconf目录,如果SPARK_CONF_DIR为空，返回${SPARK_HOME}/conf
export SPARK_CONF_DIR="${SPARK_CONF_DIR:-"${SPARK_HOME}/conf"}"    

# Add the PySpark classes to the PYTHONPATH:
if [ -z "${PYSPARK_PYTHONPATH_SET}" ]; then    #如果pyspark状态为未设置
  export PYTHONPATH="${SPARK_HOME}/python:${PYTHONPATH}"    #添加pyspark到python路径
  export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.7-src.zip:${PYTHONPATH}"
  export PYSPARK_PYTHONPATH_SET=1
fi

start-master.sh

主要是：设置master类、设置master节点host、port、启动load-saprk-env.sh、spark-daemon.sh。

# Starts the master on the machine this script is executed on.

if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# NOTE: This exact class name is matched downstream by SparkSubmit. # 这个主类名用于下游sparksubmit匹配
# Any changes need to be reflected there.
# 设置CLASS变量为master类
CLASS="org.apache.spark.deploy.master.Master"

# 判断参数，打印帮助信息
# 如果输入的是spark-master.sh --help或-h,打印帮助信息并退出
# 但是从start-all.sh传过的,来没有参数
if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  echo "Usage: ./sbin/start-master.sh [options]"
  pattern="Usage:"
  pattern+="\|Using Spark's default log4j profile:"    # +=添加,\|表示或
  pattern+="\|Registered signal handlers for"

  # 打印帮助信息
  # 加载spark-class,再执行launch/main,见下面spark-class中build_command方法
  # 传入参数.master类，调用--help方法打印帮助信息
  # 将错误信息重定向到标准输出,过滤含有pattern的字符串
  # 完整：spark-class org.apache.spark.deploy.master.Master --help
  "${SPARK_HOME}"/bin/spark-class $CLASS --help 2>&1 | grep -v "$pattern" 1>&2
  exit 1
fi

# 将调用start-master的参数列表赋值给ORIGINAL_ARGS
# 从start-all.sh传过的,来没有参数
ORIGINAL_ARGS="$@"

. "${SPARK_HOME}/sbin/spark-config.sh"    # 同样加载conf目录为环境变量

. "${SPARK_HOME}/bin/load-spark-env.sh"    # 启动加载spark-env的脚本

if [ "$SPARK_MASTER_PORT" = "" ]; then    # 如果master端口为空,设置默认为7077
  SPARK_MASTER_PORT=7077
fi

# 设置master的host,即当前脚本运行主机名
if [ "$SPARK_MASTER_HOST" = "" ]; then    # 如果master的host为空
  case `uname` in        # 匹配hostname,lunix下查看hostname命令为uname
      (SunOS)            # 如果hostname为SunOs,设置host为查看hostname的最后一个字段
	  SPARK_MASTER_HOST="`/usr/sbin/check-hostname | awk '{print $NF}'`"
	  ;;    # 匹配中断
      (*)    # 如果hostname为其他,设置为hostname -f查看的结果
	  SPARK_MASTER_HOST="`hostname -f`"
	  ;;
  esac    #匹配结束
fi

# 如果webUI端口为空,设置默认为8080
if [ "$SPARK_MASTER_WEBUI_PORT" = "" ]; then
  SPARK_MASTER_WEBUI_PORT=8080
fi

# 启动spark-daemon脚本,参数为：start、$CLASS、1、host、port、webUI-port、$ORIGINAL_ARGS
# 直译为:
# sbin/spark-daemon.sh start org.apache.spark.deploy.master.Master 1
# --host hostname --port 7077 --webui-port 8080
"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \    
  --host $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT \
  $ORIGINAL_ARGS

spark-class

主要加载java目录、spark-jars目录、调用org.apache.spark.launcher.Main解析后返回参数并执行参数中的类。其实最后就是执行各种类。

几乎所有的spark服务最终都会调用spark-class来执行类。

# 导入spark目录
# source执行命令不重启生效，如source /etc/profile
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# 导入spark-env中的设置
. "${SPARK_HOME}"/bin/load-spark-env.sh

# Find the java binary
# 加载java路径,赋值给RUNNER
if [ -n "${JAVA_HOME}" ]; then         # 判断java环境变量不为0,获取java路径
  RUNNER="${JAVA_HOME}/bin/java"
else
  if [ "$(command -v java)" ]; then    # 如果为0,查看java路径
    RUNNER="java"
  else
    echo "JAVA_HOME is not set" >&2    # 查看不到报错,并错误退出
    exit 1
  fi
fi

# Find Spark jars.
# 加载spark的jars目录位置
if [ -d "${SPARK_HOME}/jars" ]; then    #判断是目录
  SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
  SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi

# 定义执行文件径路LAUNCH_CLASSPATH
# 判断spark_jars不是目录,而且Testing变量值为0,报错退出
# 都存在则定义执行文件径路为jars位置下
if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
  echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
  echo "You need to build Spark with the target \"package\" before running this program." 1>&2
  exit 1
else
  LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi

# Add the launcher build dir to the classpath if requested.
# 根据运行环境不同,为执行文件目录添加多路径
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
 LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

# For tests
# 如果Test模式,关闭Yarn和Hadoop
if [[ -n "$SPARK_TESTING" ]]; then
  unset YARN_CONF_DIR
  unset HADOOP_CONF_DIR
fi

# The launcher library will print arguments separated by a NULL character, to allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
# an array that will be used to exec the final command.
#
# The exit code of the launcher is appended to the output, so the parent shell removes it from the
# command array and checks the value to see if the launcher succeeded.

# 调用执行文件目录下的org.apache.spark.launcher.Main方法
# 传入执行类及参数,解析后返回参数列表
build_command() {
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  printf "%d\0" $?        # 执行完成,添加一个成功的状态码0,%d十进制,\0就是0
}

# Turn off posix mode since it does not allow process substitution

# 关闭posix,posix不允许进程替换,并定义一个数组CMD
set +o posix
CMD=()

# 调用build_command方法,就是调用org.apache.spark.launcher.Main
# 传入执行类,解析后得到参数,使用read命令通过while循环读取放到CMD数组中
while IFS= read -d '' -r ARG; do    #读取ARG,-d界定符,-r禁止反斜线转义
  CMD+=("$ARG")
done < <(build_command "$@")

# ${#CMD[@]}是CMD数组元素个数,shell的${arr[@]}所有元素,${#arr[@]}数组长度
COUNT=${#CMD[@]}
LAST=$((COUNT - 1))     # 同last = arr.len-1,最后一个元素索引,就是添加的退出码
LAUNCHER_EXIT_CODE=${CMD[$LAST]}    

# Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes
# the code that parses the output of the launcher to get confused. In those cases, check if the
# exit code is an integer, and if it's not, handle it as a special error case.

# 检查退出码是否为整数，不是则打印除退出码以外的参数,以错误码退出
# ^匹配开始,[0-9]范围,+出现1次以上,$匹配结束
# head -n-1 显示到倒数第一行
if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
  echo "${CMD[@]}" | head -n-1 1>&2
  exit 1
fi

# 如果退出码不为0,将其作为退出码退出
if [ $LAUNCHER_EXIT_CODE != 0 ]; then
  exit $LAUNCHER_EXIT_CODE
fi

# 执行CMD中的java -cp命令
# ${CMD[@]:0:$LAST} 去除索引为0的,取1到最后一个元素
CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"

find_spark_home

作用是加载Spark环境变量，就算用户没有设置，这个脚本也会自动导入。

# Attempts to find a proper value for SPARK_HOME. Should be included using "source" directive.

# 获取查找spark_home的python脚本路径
FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "$(dirname "$0")"; pwd)/find_spark_home.py"

# Short circuit if the user already has this set.
# 如果spark_home环境变量不为0,正常退出这个脚本
if [ ! -z "${SPARK_HOME}" ]; then
   exit 0
# 如果spark_home环境变量为0,判断python脚本不是文件,只要安装了python这个脚本就会存在,会执行下面导入环境变量
elif [ ! -f "$FIND_SPARK_HOME_PYTHON_SCRIPT" ]; then
    # 如果安装了pip这个脚本就会存在这个目录下,但是应该使用spark的pyspark,而不是python的
  # If we are not in the same directory as find_spark_home.py we are not pip installed so we don't
  # need to search the different Python directories for a Spark installation.
  # Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or
  # spark-submit in another directory we want to use that version of PySpark rather than the
  # pip installed version of PySpark.
  # 进入脚本父目录,就是spark目录,导入spark_home环境变量
  export SPARK_HOME="$(cd "$(dirname "$0")"/..; pwd)"    
else
  # We are pip installed, use the Python script to resolve a reasonable SPARK_HOME
  # Default to standard python interpreter unless told otherwise
  # python脚本不是文件,判断python提供的pyspark环境变量是否为0
  if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
     PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}"    # 为0,定义python路径为pyspark路径
  fi    # python脚本是文件,定义spark目录为python提供的pyspark
  export SPARK_HOME=$($PYSPARK_DRIVER_PYTHON "$FIND_SPARK_HOME_PYTHON_SCRIPT")
fi

load-spark-env.sh

负责加载spark-env.sh中的配置为环境变量、加载scala版本。

spark-env.sh是最开始设置的各种配置。

# This script loads spark-env.sh if it exists, and ensures it is only loaded once.
# spark-env.sh is loaded from SPARK_CONF_DIR if set, or within the current directory's
# spark-env.sh脚本只能被SPARK_CON_DIR和这个脚本加载
# conf/ subdirectory.

# Figure out where Spark is installed
# 加载spark目录
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# 导入spark-env配置为环境变量
if [ -z "$SPARK_ENV_LOADED" ]; then    # 判断SPARK_ENV_LOADED为空
  export SPARK_ENV_LOADED=1

  # 导入sparkconf路径
  export SPARK_CONF_DIR="${SPARK_CONF_DIR:-"${SPARK_HOME}"/conf}" 

  # 如果sparkconf路径下的spark-env.sh为文件
  if [ -f "${SPARK_CONF_DIR}/spark-env.sh" ]; then 
    # Promote all variable declarations to environment (exported) variables
    set -a
    . "${SPARK_CONF_DIR}/spark-env.sh"    # 提升spark-env.sh中的变量为环境变量
    set +a
  fi
fi

# Setting SPARK_SCALA_VERSION if not already set.
# 加载scala版本
if [ -z "$SPARK_SCALA_VERSION" ]; then

  ASSEMBLY_DIR2="${SPARK_HOME}/assembly/target/scala-2.11"    # 设置scala版本,预设置两个版本
  ASSEMBLY_DIR1="${SPARK_HOME}/assembly/target/scala-2.12"

  if [[ -d "$ASSEMBLY_DIR2" && -d "$ASSEMBLY_DIR1" ]]; then    # 判断是否都为目录
    # -e激活转义,输出错误信息,检测到多个版本,1>&2标准输出重定向到错误日志
    echo -e "Presence of build for multiple Scala versions detected." 1>&2
    # 提示要么删除一个,要么导入spark-env脚本中的scala版本
    echo -e 'Either clean one of them or, export SPARK_SCALA_VERSION in spark-env.sh.' 1>&2
    exit 1
  fi

  # 判断只有一个版本,优先2.11
  if [ -d "$ASSEMBLY_DIR2" ]; then    
    export SPARK_SCALA_VERSION="2.11"
  else
    export SPARK_SCALA_VERSION="2.12"
  fi
fi

spark-env.sh

主要是设置的配置信息，目录，端口等。

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of executors to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)

###
### === IMPORTANT ===
### Change the following to specify a real cluster's Master host
###
# 导入standalone模式下master的hostname
export STANDALONE_SPARK_MASTER_HOST=`hostname`
# 导入master的host
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST

### Let's run everything with JVM runtime, instead of Scala
# 导入设置,scala执行参数为0,用jvm运行,spark lib目录,master及worker的port,webUI-port
# worker工作目录,log目录.pid目录
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=18081
export SPARK_WORKER_DIR=/var/run/spark/work
export SPARK_LOG_DIR=/var/log/spark
export SPARK_PID_DIR='/var/run/spark/'

# hadoop目录
if [ -n "$HADOOP_HOME" ]; then
  export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/libfakeroot:/usr/lib64/libfakeroot:/usr/lib32/libfakeroot:/usr/lib/hadoop/lib/native
fi

# hadoop conf目录
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}

# 为spark的classpath添加各种依赖包路径
if [[ -d $SPARK_HOME/python ]]
then
    for i in 
    do
        SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:$i
    done
fi

SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$SPARK_LIBRARY_PATH/spark-assembly.jar"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-hdfs/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-hdfs/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-mapreduce/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-mapreduce/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-yarn/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-yarn/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hive/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/flume-ng/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/parquet/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/avro/*"

spark-daemon.sh

负责解析参数、初始化环境变量：conf目录、日志目录、pid文件存储目录、设置调度的优先级、启动master进程。

方法主要有：日志归档、case匹配命令（submit、start、stop、status），run_command，execute_command。

判断命令：submit、start调用run_command，再调用spark-class或spark-submit。stop、status命令进行相应处理。

参数：

从spark-master传入的：start org.apache.spark.deploy.master.Master 1 --host hostname --port 7077 --webui-port 8080

# Runs a Spark command as a daemon.
# 将spark作为守护进程运行
#
# Environment Variables
#
# 这些变量都会在下面设置
#   SPARK_CONF_DIR  Alternate conf dir. Default is ${SPARK_HOME}/conf. # 后备conf目录
#   SPARK_LOG_DIR   Where log files are stored. ${SPARK_HOME}/logs by default. #日志目录
#   SPARK_MASTER    host:path where spark code should be rsync'd from # spark代码同步的路径
#   SPARK_PID_DIR   The pid files are stored. /tmp by default. # PID文件目录
#   SPARK_IDENT_STRING   A string representing this instance of spark. $USER by default # spark实例名
#   SPARK_NICENESS The scheduling priority for daemons. Defaults to 0. # 优先级
#   SPARK_NO_DAEMONIZE   If set, will run the proposed command in the foreground. It will not output a PID file. # 设置是否在前台运行
##

# 使用提示:spark-daemon.sh..
usage="Usage: spark-daemon.sh [--config <conf-dir>] (start|stop|submit|status) <spark-command> <spark-instance-number> <args...>"

# if no args specified, show usage
# 参数个数小于1个提示spark-daemon.sh说明
if [ $# -le 1 ]; then
  echo $usage
  exit 1
fi

# 导入spark目录
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# 导入conf和pyspark
. "${SPARK_HOME}/sbin/spark-config.sh"    # 读取了sparkconf目录和导入pyspark

# get arguments
# Check if --config is passed as an argument. It is an optional parameter.
# Exit if the argument is not a directory.
# 设置conf目录,参数传入的要优先于初始conf目录设置的
# config目录是可选参数,不是目录则退出
# 从master传入的参数没有--config设置,所以这里跳过
if [ "$1" == "--config" ]  # 检测第一个参数为--config
then
  shift                    # shift不带参表示1,参数列表往左移1个位置,原来的$1丢失
  conf_dir="$1"            # 参数左移1后,$1是输入的目录路径
  if [ ! -d "$conf_dir" ]  # 如果conf_dir不是目录,提示错误信息
  then
    echo "ERROR : $conf_dir is not a directory"
    echo $usage
    exit 1
  else
    export SPARK_CONF_DIR="$conf_dir"    # 是目录则导入conf路径
  fi
  shift            # 总左移2次
fi

# 从.master调用这个脚本传入的参数为:
# start org.apache.spark.deploy.master.Master 1 --host hostname --port 7077 --webui-port 8080
option=$1          # $1是操作命令,start/stop/submit/status,这里是start
shift
command=$1         # 左移1次后,$1是org.apache.spark.master.Master
shift
instance=$1        # 左移2次后,是spark实例个数,默认的1
shift              # 左移3次后,参数为:
                   # --host hostname --port 7077 --webui-port 8080

# spark日志归档的方法
spark_rotate_log ()
{
    log=$1;       # 传入参数为$log,在下面定义和设置
    num=5;        # 设置一个默认数5
    if [ -n "$2" ]; then    # 判断$2长度不为0,则赋值给num
	num=$2
    fi
    if [ -f "$log" ]; then # rotate logs       # 判断log是文件
	while [ $num -gt 1 ]; do               # 如果num>1,循环执行
	    prev=`expr $num - 1`               # 运算num-1,并赋值给prev
        # 如果log.num-1是文件,则重命名为log.num
	    [ -f "$log.$prev" ] && mv "$log.$prev" "$log.$num"
	    num=$prev             # num减一
	done                      # 最后结果为log.2
	mv "$log" "$log.$num";    # 重命名log为log.1
    fi
}

# 加载spark-env中的配置
. "${SPARK_HOME}/bin/load-spark-env.sh"

# 设置spark实例名称,为空,则默认为当前用户名
if [ "$SPARK_IDENT_STRING" = "" ]; then
  export SPARK_IDENT_STRING="$USER"
fi

# 设置执行打印状态为1
export SPARK_PRINT_LAUNCH_COMMAND="1"

# get log directory
# 获取log目录
if [ "$SPARK_LOG_DIR" = "" ]; then         # 如果日志目录为空,导入日志目录路径
  export SPARK_LOG_DIR="${SPARK_HOME}/logs"
fi
mkdir -p "$SPARK_LOG_DIR"    # 创建日志目录
touch "$SPARK_LOG_DIR"/.spark_test > /dev/null 2>&1    # 创建test文件,并重定向输出到null
TEST_LOG_DIR=$?    # 接收test文件创建执行状态
if [ "${TEST_LOG_DIR}" = "0" ]; then    #如果上次执行状态为0,说明创建test文件成功,删除test文件
  rm -f "$SPARK_LOG_DIR"/.spark_test
else
  chown "$SPARK_IDENT_STRING" "$SPARK_LOG_DIR"    # 创建失败,则修改log目录权限用户为spark实例名,$USER
fi

# 定义PID默认目录
if [ "$SPARK_PID_DIR" = "" ]; then
  SPARK_PID_DIR=/tmp
fi

# some variables
# 设置log和pid文件
log="$SPARK_LOG_DIR/spark-$SPARK_IDENT_STRING-$command-$instance-$HOSTNAME.out"
pid="$SPARK_PID_DIR/spark-$SPARK_IDENT_STRING-$command-$instance.pid"

# Set default scheduling priority
# 设置进程优先级,默认0
if [ "$SPARK_NICENESS" = "" ]; then
    export SPARK_NICENESS=0
fi

# 启动spark进程的方法
# 作用是前/后台执行各种类,获取pid,判断进程是否启动起来
execute_command() {
  # 判断前台执行还是后台执行
  if [ -z ${SPARK_NO_DAEMONIZE+set} ]; then       # 判断前台执行参数为空,则后台执行
      nohup -- "$@" >> $log 2>&1 < /dev/null &    # 后台运行,输出和错误到log,将/dev/null作为标准输入
      newpid="$!"    # 后台运行的PID

      echo "$newpid" > "$pid"    # 将后台PID覆盖前面设置的PID文件

      # Poll for up to 5 seconds for the java process to start
      # 每0.5秒检查一次java进程是否启动,尝试10次
      for i in {1..10}    
      do
        # 检测java进程是否启动
        # ps -p pid进程开始使用cpu,-o comm=自定义程序名,=~开启正则匹配,变量常量可以用引号
        if [[ $(ps -p "$newpid" -o comm=) =~ "java" ]]; then
           break     # 如果启动就退出
        fi
        sleep 0.5    # 没有启动,睡眠0.5秒继续检测
      done

      sleep 2    # 等待2秒后检测进程是否还在,否则跟踪日志
      # Check if the process has died; in that case we'll tail the log so the user can see
      if [[ ! $(ps -p "$newpid" -o comm=) =~ "java" ]]; then    # 检测java进程不存在
        echo "failed to launch: $@"        # 没有进程提示执行失败
        tail -10 "$log" | sed 's/^/  /'    # 查看日志最后10行,行前插入2空格
        echo "full log in $log"            # 提示以上是执行日志完整信息
      fi
  else
      "$@"    # 前台执行参数不为空,直接在前台执行
  fi
}

# 执行命令submit/class
# 这里通过下面的option匹配后,传入的参数为:
# submit/class --host hostname --port 7077 --webui-port 8080
run_command() {
  mode="$1"	# 参数为submit/class --host hostname --port 7077 --webui-port 8080
  shift    	# 参数列表左移1

  mkdir -p "$SPARK_PID_DIR"    	# 连级创建pid目录,前面已经定义pid目录默认位置

  # 根据pid文件检测进程是否启动
  if [ -f "$pid" ]; then	        # 判断pid文件是否为文件
    TARGET_ID="$(cat "$pid")"	    # 查看pid并取出给TARGET_ID变量
    if [[ $(ps -p "$TARGET_ID" -o comm=) =~ "java" ]]; then            	#判断进程是否存在
      #存在就提示已经运行,先停止,因为是submit/start命令
      echo "$command running as process $TARGET_ID.  Stop it first."	
      exit 1	# 返回非正常错误代码退出
    fi
  fi

  # master类不为空,从spark_home目录同步master路径,就是启动的主机
  if [ "$SPARK_MASTER" != "" ]; then	
    echo rsync from "$SPARK_MASTER"
    rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' "$SPARK_MASTER/" "${SPARK_HOME}"
  fi

  # 调用日支归档,参数为$log,并提示执行命令输出到日志
  spark_rotate_log "$log"
  echo "starting $command, logging to $log"

  case "$mode" in
  
	# nice -n 指定优先级执行,$SPARK_NICENESS是前面指定的优先级0
	# bash 执行后面的脚本
    # 如果参数是class,即start
    # 直译为:
    # execute_commamd nice -n 0 bin/spark-class org.apache.spark.master.Master
    # --host hostname --port 7077 --webui-port 8080
    (class)	
      execute_command nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-class "$command" "$@"
      ;;

    # 直译为:
    # execute_commamd nice -n 0 bin/spark-submit --class org.apache.spark.master.Master
    # --host hostname --port 7077 --webui-port 8080
    (submit)
      execute_command nice -n "$SPARK_NICENESS" bash "${SPARK_HOME}"/bin/spark-submit --class "$command" "$@"
      ;;

    (*)	# 其他报错退出
      echo "unknown mode: $mode"
      exit 1
      ;;
  esac

}

# 匹配spark命令,submit/start/stop/status,执行不同操作
case $option in

	# 为submit时,执行run_command
  (submit)	
    run_command submit "$@"	
    ;;
	
	# 为start时，执行run_command
  (start)
    run_command class "$@"
    ;;

  (stop)	# 为stop时

    # 根据pid判断进程,存在就kill,不存在报错
    if [ -f $pid ]; then	        # 如果pid是文件
      TARGET_ID="$(cat "$pid")"    	# 查看pid并取出给TARGET_ID变量
      if [[ $(ps -p "$TARGET_ID" -o comm=) =~ "java" ]]; then	# 如果pid的java进程存在
        echo "stopping $command"	# 提示停止信息
        kill "$TARGET_ID" && rm -f "$pid"	# kill掉进程并删除pid文件
      else
        echo "no $command to stop"	# 进程不存在提示
      fi
    else
      echo "no $command to stop"	# pid不存在提示
    fi
    ;;

  (status)	# 为status时

    if [ -f $pid ]; then	# pid文件存在,查看pid和进程,提示正在运行,返回exit 0 正常退出
      TARGET_ID="$(cat "$pid")"
      if [[ $(ps -p "$TARGET_ID" -o comm=) =~ "java" ]]; then
        echo $command is running.
        exit 0
      else
        echo $pid file is present but $command not running	# 进程不存在提示,返回exit 1 非正常运行退出
        exit 1
      fi
    else	# 如果pid不是文件,提示命令没有运行,返回exit 3 找不到指定文件退出
      echo $command not running.
      exit 2
    fi
    ;;

  (*)	# 其他,提示spark-daemon.sh用法
    echo $usage
    exit 1
    ;;

esac

start-slaves.sh

主要是查看master的host、port，加载slaves.sh

# Starts a slave instance on each machine specified in the conf/slaves file.

# 同样先加载spark环境变量、conf目录、env下环境变量、master的host和port
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

. "${SPARK_HOME}/sbin/spark-config.sh"
. "${SPARK_HOME}/bin/load-spark-env.sh"

# Find the port number for the master
if [ "$SPARK_MASTER_PORT" = "" ]; then
  SPARK_MASTER_PORT=7077
fi

if [ "$SPARK_MASTER_HOST" = "" ]; then
  case `uname` in
      (SunOS)
	  SPARK_MASTER_HOST="`/usr/sbin/check-hostname | awk '{print $NF}'`"
	  ;;
      (*)
	  SPARK_MASTER_HOST="`hostname -f`"
	  ;;
  esac
fi

# Launch the slaves
# 启动slaves.sh
# 参数为:
# cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"
"${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"

slaves.sh

负责加载从节点列表，并ssh免密登录到各个节点执行start-slave.sh。

# Run a shell command on all slave hosts.
# 会在所有从节点启动一个脚本
#
# Environment Variables
#
# 默认从con/slaves文件加载
#   SPARK_SLAVES    File naming remote hosts.
#     Default is ${SPARK_CONF_DIR}/slaves.
#   SPARK_CONF_DIR  Alternate conf dir. Default is ${SPARK_HOME}/conf.
#   SPARK_SLAVE_SLEEP Seconds to sleep between spawning remote commands.
#   SPARK_SSH_OPTS Options passed to ssh when running remote commands.
##

# 提示slaves.sh说明
usage="Usage: slaves.sh [--config <conf-dir>] command..."

# if no args specified, show usage
# 参数个数小于等于0,提示说明,错误1退出
if [ $# -le 0 ]; then
  echo $usage
  exit 1
fi

# 导入spark_home环境变量
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# 加载sparkconf目录,导入pyspark环境变量
. "${SPARK_HOME}/sbin/spark-config.sh"

# If the slaves file is specified in the command line,
# then it takes precedence over the definition in
# spark-env.sh. Save it here.
# 如果这里指定了从节点列表,会优先于spark-env设置
# spark-slaves是文件,说明有指定从节点,导入从节点列表
if [ -f "$SPARK_SLAVES" ]; then
  HOSTLIST=`cat "$SPARK_SLAVES"`
fi

# Check if --config is passed as an argument. It is an optional parameter.
# Exit if the argument is not a directory.
# 导入sparkconf配置目录
# --config是可选参数,如果有设置,参数列表左移1,判断--config后面的值
# 如果值不是目录,报错并提示用法,是目录则导入conf目录为环境变量,参数列表再左移1，剩下参数为命令
if [ "$1" == "--config" ]
then
  shift
  conf_dir="$1"
  if [ ! -d "$conf_dir" ]
  then
    echo "ERROR : $conf_dir is not a directory"
    echo $usage
    exit 1
  else
    export SPARK_CONF_DIR="$conf_dir"
  fi
  shift
fi

# 加载spark-env.sh中的设置为环境变量,加载scala版本
. "${SPARK_HOME}/bin/load-spark-env.sh"

# 加载从节点列表
# 如果从节点列表为空，spark-slaves环境变量为空，加载conf/slaves原始设置的从节点列表文件
if [ "$HOSTLIST" = "" ]; then
  if [ "$SPARK_SLAVES" = "" ]; then
    if [ -f "${SPARK_CONF_DIR}/slaves" ]; then
      HOSTLIST=`cat "${SPARK_CONF_DIR}/slaves"`
    else
      HOSTLIST=localhost
    fi
  # spark-slaves环境变量不为空,取出作为从节点列表
  else
    HOSTLIST=`cat "${SPARK_SLAVES}"`
  fi
fi

# By default disable strict host key checking
# 设置SSH密钥检查状态。
# 为空就设置值为no,默认不进行密钥检查
if [ "$SPARK_SSH_OPTS" = "" ]; then
  SPARK_SSH_OPTS="-o StrictHostKeyChecking=no"
fi

# ssh到所有节点,并执行参数:
# cd "${SPARK_HOME}" \
# "${SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"
# 遍历从节点列表,删除注释和空行。取出值,也就是取出所有从节点host
for slave in `echo "$HOSTLIST"|sed  "s/#.*$//;/^$/d"`; do
  if [ -n "${SPARK_SSH_FOREGROUND}" ]; then    # 判断是否前台执行参数,-n长度不为0，
    # //后的替换为/后,就是替换空格为"\ "
    # $"${@// /\\ }"是start-slaves.sh传过来的参数,即:
    # cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin/start-slave.sh"
    # "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"
    ssh $SPARK_SSH_OPTS "$slave" $"${@// /\\ }" \
      2>&1 | sed "s/^/$slave: /"    # 替换最前面的串为"$slave："
  else
    ssh $SPARK_SSH_OPTS "$slave" $"${@// /\\ }" \
      2>&1 | sed "s/^/$slave: /" &    # &后台执行
  fi

  # ssh登录后睡眠
  if [ "$SPARK_SLAVE_SLEEP" != "" ]; then
    sleep $SPARK_SLAVE_SLEEP
  fi
done

wait

start-slave.sh

启动从节点，加载worker的host、port、调用spark-saemon.sh

# Starts a slave on the machine this script is executed on.
#
# Environment Variables
#
#   # worker实例,默认1个
#   SPARK_WORKER_INSTANCES  The number of worker instances to run on this
#                           slave.  Default is 1.
#   SPARK_WORKER_PORT       The base port number for the first worker. If set,
#                           subsequent workers will increment this number.  If
#                           unset, Spark will find a valid port number, but
#                           with no guarantee of a predictable pattern.
#   SPARK_WORKER_WEBUI_PORT The base port for the web interface of the first
#                           worker.  Subsequent workers will increment this
#                           number.  Default is 8081.

# 导入spark目录
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# 定义worker类
# NOTE: This exact class name is matched downstream by SparkSubmit.
# Any changes need to be reflected there.
CLASS="org.apache.spark.deploy.worker.Worker"

# 参数个数小于1,或者有"--help"或"-h",提示说明
if [[ $# -lt 1 ]] || [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  echo "Usage: ./sbin/start-slave.sh [options] <master>"
  pattern="Usage:"
  pattern+="\|Using Spark's default log4j profile:"
  pattern+="\|Registered signal handlers for"

  # 调用spark-class,再执行launch/main,参入参数为worker --help打印提示,退出码1
  "${SPARK_HOME}"/bin/spark-class $CLASS --help 2>&1 | grep -v "$pattern" 1>&2
  exit 1
fi

# 导入conf及其他环节变量,跟master相同
. "${SPARK_HOME}/sbin/spark-config.sh"

. "${SPARK_HOME}/bin/load-spark-env.sh"

# First argument should be the master; we need to store it aside because we may
# need to insert arguments between it and the other arguments
# 设置master变量为$1,也就是spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT
# 传入的参数只有一个,参数列表左移1,去除了$1也就没有值了
MASTER=$1
shift

# worker web端口,设置默认8081
# Determine desired worker port
if [ "$SPARK_WORKER_WEBUI_PORT" = "" ]; then
  SPARK_WORKER_WEBUI_PORT=8081
fi

# 启动worker实例的方法
# Start up the appropriate number of workers on this machine.
# quick local function to start a worker
function start_instance {
  WORKER_NUM=$1    # 就是下面调用传入的1
  shift            # 左移后没有值了

  # 如果worker的port为空，传入空值做为spark-daemon.sh的参数
  if [ "$SPARK_WORKER_PORT" = "" ]; then
    PORT_FLAG=
    PORT_NUM=
  else
    PORT_FLAG="--port"
    PORT_NUM=$(( $SPARK_WORKER_PORT + $WORKER_NUM - 1 ))
  fi
  # 各个worker web端口不同,从8081开始递增
  WEBUI_PORT=$(( $SPARK_WORKER_WEBUI_PORT + $WORKER_NUM - 1 ))

  # spark-daemon.sh start $work-class 1 --webUI-port "$WEBUI_PORT" $PORT_FLAG $PORT_NUM $MASTER
  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
     --webui-port "$WEBUI_PORT" $PORT_FLAG $PORT_NUM $MASTER "$@"
}

# 调用start_instance方法启动worker实例
# 没有启动worker实例则调用方法启动
# 上面已经shift左移,所以$@没有值,传入start_instance的参数就是1
if [ "$SPARK_WORKER_INSTANCES" = "" ]; then
  start_instance 1 "$@"
else
  # 已经启动,则判断实例个数,决定是否继续启动
  for ((i=0; i<$SPARK_WORKER_INSTANCES; i++)); do
    start_instance $(( 1 + $i )) "$@"
  done
fi

spark-shell

启动spark交互式shell模式，主要是判断系统类型，调用spark-submit，执行class org.apache.spark.repl.Main，设置name为spark-shell。

# Shell script for starting the Spark Shell REPL

# 判断是否windows系统,默认false
# cygwin是一个在windows平台上运行的类UNIX模拟环境
cygwin=false
case "$(uname)" in
  CYGWIN*) cygwin=true;;
esac

# Enter posix mode for bash
# 启用posix
set -o posix

# 加载spark目录
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# 导入说明
export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]

Scala REPL options:
  -I <file>                   preload <file>, enforcing line-by-line interpretation"

# SPARK-4161: scala does not assume use of the java classpath,
# so we need to add the "-Dscala.usejavacp=true" flag manually. We
# do this specifically for the Spark shell because the scala REPL
# has its own class loader, and any additional classpath specified
# through spark.driver.extraClassPath is not automatically propagated.
# 设置启用java类加载,因为scala默认不会读取java的classpath
SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true"

# 判断不同系统下调用submit方式
# 如果是cygwin关闭回显,设置读操作最少1个字符
# 调用submit执行org.apache.spark.repl.Main。完成后恢复
# 当linux时,设置java加载类,调用submit
function main() {
  if $cygwin; then
    # JLine是一个用来处理控制台输入的java类库
    # Workaround for issue involving JLine and Cygwin
    # (see http://sourceforge.net/p/jline/bugs/40/).
    # If you're using the Mintty terminal emulator in Cygwin, may need to set the
    # "Backspace sends ^H" setting in "Keys" section of the Mintty options
    # (see https://github.com/sbt/sbt/issues/562).

    # stty是unix下的命令,用来改变终端显示,如说关闭某些按键,开启特殊字符的输入
    # stty -icanon设置一次性读完操作,stty icanon echo关闭前面的设置
    # min N和-icanon配合使用,设置每次读入的最小字符数为N
    # stty -echo关闭输入回显,如用于shell中输入密码等敏感信息
    # >/dev/null 2>&1就是 1> /dev/null 2> &1,标准输出和错误都重定向到黑洞
    # 在win系统下时,导入设置submit支持jline终端为unix模式
    # 直接开启spark-shell,没有参数传入:
    # bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell"
    stty -icanon min 1 -echo > /dev/null 2>&1
    export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix"
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
    stty icanon echo > /dev/null 2>&1
  else

    # 其他linux等系统下导入使用java类加载
    export SPARK_SUBMIT_OPTS
    "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
  fi
}

# Copy restore-TTY-on-exit functions from Scala script so spark-shell exits properly even in
# binary distribution of Spark where Scala is not installed
# 设置退出码127,意思是没有找到命令
# 保存stty设置,默认为空
exit_status=127
saved_stty=""

# restore stty settings (echo in particular)
# 恢复终端设置
# 还原stty设置,特备是恢复回显
# stty size打印终端的行和列数,在linux下查看
# 这里就是设置stty设置为空
function restoreSttySettings() {
  stty $saved_stty
  saved_stty=""
}

# 判断是否恢复终端设置
# 如果stty设置不为空,调用restoreSttySettings还原stty设置,为空则以127退出
function onExit() {
  if [[ "$saved_stty" != "" ]]; then
    restoreSttySettings
  fi
  exit $exit_status
}

# to reenable echo if we are interrupted before completing.
# 如果还原stty设置是被打断的,捕获信息,再执行onExit命令
# 通过kill -l查看信号类型:序号2-类型SIGINT-含义中断进程
# trap command INT:捕获中断进程信息重新执行onExit命令
trap onExit INT

# save terminal settings
# stty -g以stty可读的方式打印当前所有设置
# 保存stty设置
saved_stty=$(stty -g 2>/dev/null)
# clear on error so we don't later try to restore them
# 如果设置状态不正确,非0退出,就还原设置
if [[ ! $? ]]; then
  saved_stty=""
fi

# 调用main方法执行
main "$@"

# record the exit status lest it be overwritten:
# then reenable echo and propagate the code.
# 记录退出状态,调用onExit开启回显
exit_status=$?
onExit

spark-submit

主要是提交jar到Spark集群运行时，参数由自己设定，后续解析类时再查看有哪些参数设置。

# 导入spark环境变量
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
# 关闭python3.3+的字符串随机散列
export PYTHONHASHSEED=0

# 如果输入spark-submit提交,执行spark-class,调用org.apache.spark.deploy.SparkSubmit,参数为启动时设置
# 直接启动spark-shell,参数为:--class org.apache.spark.repl.Main --name "Spark shell"
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

start-history-server.sh

历史服务，主要是调用spark-daemon.sh，调用org.apache.spark.deploy.history.HistoryServer启动。

# Starts the history server on the machine this script is executed on.
#
# Usage: start-history-server.sh
#
# Use the SPARK_HISTORY_OPTS environment variable to set history server configuration.
#

# 导入spark环境变量,加载conf目录,加载spark-env中的配置
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# 加载sparkconf目录,pyspark路径,spark-env.sh中的配置
. "${SPARK_HOME}/sbin/spark-config.sh"
. "${SPARK_HOME}/bin/load-spark-env.sh"

# 执行spark-daemon.sh脚本,参数: 
# start org.apache.spark.deploy.history.HistoryServer 1 "$@"
exec "${SPARK_HOME}/sbin"/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1 "$@"

start-shuffle-service.sh

# Starts the external shuffle server on the machine this script is executed on.
#
# Usage: start-shuffle-server.sh
#
# Use the SPARK_SHUFFLE_OPTS environment variable to set shuffle server configuration.
#
# 加载spark环境变量
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# 加载sparkconf目录,pyspark路径,spark-env.sh中的配置
. "${SPARK_HOME}/sbin/spark-config.sh"
. "${SPARK_HOME}/bin/load-spark-env.sh"

# 执行spark-daemon.sh,参数:
# start org.apache.spark.deploy.ExternalShuffleService 1
exec "${SPARK_HOME}/sbin"/spark-daemon.sh start org.apache.spark.deploy.ExternalShuffleService 1

start-thriftserver.sh

SparkSQL服务，主要是执行spark-daemon.sh，调用org.apache.spark.sql.hive.thriftserver.HiveThriftServer2。

# Shell script for starting the Spark SQL Thrift server

# Enter posix mode for bash
# 开启posix
set -o posix

# 导入spark环境变量
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# NOTE: This exact class name is matched downstream by SparkSubmit.
# Any changes need to be reflected there.
# SparkSQL类为HiveThriftServer2
CLASS="org.apache.spark.sql.hive.thriftserver.HiveThriftServer2"

# 使用说明
function usage {
  echo "Usage: ./sbin/start-thriftserver [options] [thrift server options]"
  pattern="usage"
  pattern+="\|Spark assembly has been built with Hive"
  pattern+="\|NOTE: SPARK_PREPEND_CLASSES is set"
  pattern+="\|Spark Command: "
  pattern+="\|======="
  pattern+="\|--help"

  # 调用spark-submit --help,过滤Usage串
  # 调用spark-class
  "${SPARK_HOME}"/bin/spark-submit --help 2>&1 | grep -v Usage 1>&2
  echo
  echo "Thrift server options:"
  "${SPARK_HOME}"/bin/spark-class $CLASS --help 2>&1 | grep -v "$pattern" 1>&2
}

# 如果参数中含有--help或-h,调用usage方法打印说明并退出
if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  usage
  exit 0
fi

# 导入usage为环境变量
export SUBMIT_USAGE_FUNCTION=usage

# 执行spark-daemon.sh,参数为:
# submit $CLASS 1 --name "Thrift JDBC/ODBC Server" "$@"
exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Thrift JDBC/ODBC Server" "$@"

訾零

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
1. Spark源码解析之启动脚本解析

从零开始解读Spark源码。前期记录详细点。Spark启动方式主要有两种：start-all.sh一键启动，start-master.sh和start-slave.sh单独启动master和slave。运行Spark的方式也是两种：spark-shell和spark-submit。这里解析所有相关启动脚本。Spark集群启动脚本 start-all.sh start-mas...
复制链接

扫一扫