SparkSubmit-提交流程源码分析

最新推荐文章于 2023-10-20 14:06:27 发布

@TangXin

最新推荐文章于 2023-10-20 14:06:27 发布

阅读量532

点赞数 2

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/Happy_Sunshine_Boy/article/details/114027797

版权

Spark 专栏收录该内容

16 篇文章 2 订阅

订阅专栏

文章目录

1.提交命令
2.源码分析
3.名词解析

1.提交命令

在实际生产中，都是使用 yarn-cluster 模式提交 spark 任务，例如：

spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.11-2.3.2.3.1.0.0-78.jar \
10

2.源码分析

执行提交命令之后，首先会调用 $SPARK_HOME/bin/spark-submit 脚本，在 spark-submit 可执行文件中发现：

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

查看 bin/spark-class 可执行文件，最后会发现执行提交任务命令：/bin/java org.apache.spark.deploy.SparkSubmit --master --class 那么肯定会调用执行 SparkSubmit 的 main 方法作为程序入口，使用 IDEA 打开 Spark 源码项目（快捷键 Control+Shift+N，或者双击 Shift）去源码中查找 "org.apache.spark.deploy.SparkSubmit" ，在 Scala 程序中直接查找伴生对象。
在这里插入图片描述

override def main(args: Array[String]): Unit = {
    // 1.这里先创建了 SparkSubmit 实例
    val submit = new SparkSubmit() {
        self =>
        // 重写了 class SparkSubmit 的解析加载参数方法
        override protected def parseArguments(args: Array[String]): SparkSubmitArguments = {
            new SparkSubmitArguments(args) {
                override protected def logInfo(msg: => String): Unit = self.logInfo(msg)
                override protected def logWarning(msg: => String): Unit = self.logWarning(msg)
                override protected def logError(msg: => String): Unit = self.logError(msg)
            }
        }

        // 日志输出方法
        override protected def logInfo(msg: => String): Unit = printMessage(msg)
        // warning输出方法
        override protected def logWarning(msg: => String): Unit = printMessage(s"Warning: $msg")
        // error输出方法
        override protected def logError(msg: => String): Unit = printMessage(s"Error: $msg")
        // 3.重写任务提交方法,捕获异常
        override def doSubmit(args: Array[String]): Unit = {
            try {
                // 4.这里会进入 class SparkSubmit 的 doSubmit()
                super.doSubmit(args)
            } catch {
                case e: SparkUserAppException =>
                exitFn(e.exitCode)
            }
        }
    }
    // 2.调用上面 SparkSubmit 实例的 doSubmit()
    submit.doSubmit(args)
}

可以看到，执行 main 方法，并把参数传入 args: Array[String] ，调用 SparkSubmit实例的doSubmit(args) 并把参数传入，然后调用父类 class SparkSubmit的doSubmit(args) 并传入参数。

// 5.执行Submit方法
def doSubmit(args: Array[String]): Unit = {
    // Initialize logging if it hasn't been done yet. Keep track of whether logging needs to
    // be reset before the application starts.
    // 初始化 logging 系统，并跟日志判断是否需要在 app 启动时重启
    val uninitLog: Boolean = initializeLogIfNecessary(isInterpreter = true, silent = true)

    // 6.调用 parseArguments() 解析参数，解析了提交的 args 参数及 spark 配置文件
    val appArgs: SparkSubmitArguments = parseArguments(args)
    // 参数不重复则输出配置
    if (appArgs.verbose) {
        logInfo(appArgs.toString)
    }
    // 匹配输入的执行请求，也就是提交，终止，请求状态和打印版本
    // 在解析的时候将执行状态封装到了SparkSubmitAction中，这里进行匹配
    // 如果没有执行状态，则SparkSubmitArguments默认设置为SparkSubmitAction.SUBMIT
    // 这里提交会进入submit()
    appArgs.action match {
        case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
        case SparkSubmitAction.KILL => kill(appArgs)
        case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
        case SparkSubmitAction.PRINT_VERSION => printVersion()
    }
}

在执行 doSubmit 方法时，调用 parseArguments(args) 进行参数解析，调用方法如下：

/**
  * 解析参数的方法
  * 这里首先进入了Object SparkSubmit重写的parseArguments()中
  * parseArguments其实就是SparkSubmitArguments类的实例，先创建了SparkSubmitArguments(args)实例
  * 而SparkSubmitArguments继承了SparkSubmitArgumentsParser抽象类
  * SparkSubmitArgumentsParser继承了SparkSubmitOptionParser
  * SparkSubmitOptionParser其实也是launcher.main中解析参数的OptionParser.parser()继承的父类
  * SparkSubmitArguments类中，定义了一堆参数，其实就是各种运行模式需要的参数。
  * 这里解析了submit所有模式需要的参数和spark默认配置
  */
protected def parseArguments(args: Array[String]): SparkSubmitArguments = {
    new SparkSubmitArguments(args)
}

点击 SparkSubmitArguments 会发现调用的方法：

// Set parameters from command line arguments
// 解析命令行参数
parse(args.asJava)

点击 parse 之后，会发现是通过正则表达式，对命令行参数进行解析：
在这里插入图片描述
其中 handle 处理参数的方法，是调用的 SparkSubmitArguments 中的 handle 方法：

形如下图的参数解析示例：

在 SparkSubmitArguments 中，存在一个参数 action ：

# 声明初始化 action 参数
var action: SparkSubmitAction = null

# 做了一个判断，如果 action 没有值，则会赋值 “SUBMIT”，默认就是提交
action = Option(action).getOrElse(SUBMIT)

# 返回前面代码
appArgs.action match {
    case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
    ......
}

默认就是 submit 提交。
在这里插入图片描述

@tailrec
private def submit(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {

    // 主要是调用runMain()启动相应环境的main()的方法
    // 环境准备好以后，会先往下运行判断，这里是在等着调用
    // #2
    def doRunMain(): Unit = {
        // 提交时可以指定--proxy-user，如果没有指定，则获取当前用户
        if (args.proxyUser != null) {
            val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
                                                                 UserGroupInformation.getCurrentUser())
            try {
                proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
                    // 这里是真正的执行，runMain()
                    override def run(): Unit = {
                        runMain(args, uninitLog)
                    }
                })
            } catch {
                case e: Exception =>
                // Hadoop's AuthorizationException suppresses the exception's stack trace, which
                // makes the message printed to the output by the JVM not very helpful. Instead,
                // detect exceptions with empty stack traces here, and treat them differently.
                // hadoop的权限验证不允许堆栈跟踪，所以这里以空的堆栈来判断异常
                if (e.getStackTrace().length == 0) {
                    error(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
                } else {
                    throw e
                }
            }
        } else {
            // #3 定义了用户则直接执行runMain()
            runMain(args, uninitLog)
        }
    }

    // In standalone cluster mode, there are two submission gateways:
    //   (1) The traditional RPC gateway using o.a.s.deploy.Client as a wrapper
    //   (2) The new REST-based gateway introduced in Spark 1.3
    // The latter is the default behavior as of Spark 1.3, but Spark submit will fail over
    // to use the legacy gateway if the master endpoint turns out to be not a REST server.
    // standalone模式有两种提交网关，
    // 使用o.a.s.apply.client作为包装器的传统RPC网关和基于REST服务的网关，spark1.3后默认使用REST
    // 如果master终端没有使用REST服务，spark会故障切换到RPC
    // 这里判断standalone模式和使用REST服务
    if (args.isStandaloneCluster && args.useRest) {
        // 异常捕获，判断正确的话输出信息，进入doRunMain()
        try {
            logInfo("Running Spark using the REST application submission protocol.")
            doRunMain()
        } catch {
            // Fail over to use the legacy submission gateway
            // 否则异常输出信息，并设置submit失败
            case e: SubmitRestConnectionException =>
            logWarning(s"Master endpoint ${args.master} was not a REST server. " +
                       "Falling back to legacy submission gateway instead.")
            args.useRest = false
            submit(args, uninitLog = false)
        }
        // In all other modes, just run the main class as prepared
        // 其他模式，按准备的环境调用上面的doRunMain()运行相应的main()
        // 在进入前，初始化了SparkContext和SparkSession
    } else {
        // #1
        doRunMain()
    }
}

private def runMain(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
    // #1 特别重要，先准备运行环境，传入解析的各种参数
    val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)
    // Let the main class re-initialize the logging system once it starts.
    // 启动main后重新初始化logging
    if (uninitLog) {
        Logging.uninitialize()
    }

    if (args.verbose) {
        logInfo(s"Main class:\n$childMainClass")
        logInfo(s"Arguments:\n${childArgs.mkString("\n")}")
        // sysProps may contain sensitive information, so redact before printing
        logInfo(s"Spark config:\n${Utils.redact(sparkConf.getAll.toMap).mkString("\n")}")
        logInfo(s"Classpath elements:\n${childClasspath.mkString("\n")}")
        logInfo("\n")
    }
    // #2 获取类加载器
    val loader = getSubmitClassLoader(sparkConf)
    for (jar <- childClasspath) {
        addJarToClasspath(jar, loader)
    }

    var mainClass: Class[_] = null

    try {
        // #3 根据类的名称（字符串）获取类的信息，反射
        mainClass = Utils.classForName(childMainClass)
    } catch {
        case e: ClassNotFoundException =>
        logError(s"Failed to load class $childMainClass.")
        if (childMainClass.contains("thriftserver")) {
            logInfo(s"Failed to load main class $childMainClass.")
            logInfo("You need to build Spark with -Phive and -Phive-thriftserver.")
        }
        throw new SparkUserAppException(CLASS_NOT_FOUND_EXIT_STATUS)
        case e: NoClassDefFoundError =>
        logError(s"Failed to load $childMainClass: ${e.getMessage()}")
        if (e.getMessage.contains("org/apache/hadoop/hive")) {
            logInfo(s"Failed to load hive class.")
            logInfo("You need to build Spark with -Phive and -Phive-thriftserver.")
        }
        throw new SparkUserAppException(CLASS_NOT_FOUND_EXIT_STATUS)
    }

    // #4 判断 mainClass 是否继承 SparkApplication 如果继承，进入 if，否则进入 else
    val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass)) {
        // #5 mainClass 通过构造器创建实例，并且转换成 SparkApplication
        mainClass.getConstructor().newInstance().asInstanceOf[SparkApplication]
    } else {
        // #6 mainClass new 一个 JavaMainApplication
        new JavaMainApplication(mainClass)
    }

    @tailrec
    def findCause(t: Throwable): Throwable = t match {
        case e: UndeclaredThrowableException =>
        if (e.getCause() != null) findCause(e.getCause()) else e
        case e: InvocationTargetException =>
        if (e.getCause() != null) findCause(e.getCause()) else e
        case e: Throwable =>
        e
    }

    try {
        // #7
        app.start(childArgs.toArray, sparkConf)
    } catch {
        case t: Throwable =>
        throw findCause(t)
    }
}

上面代码可以看出，特别重要的就是 childMainClass，根据 childMainClass 得到 mainClass，继而进行后续一系列操作。而 childMainClass 是 prepareSubmitEnvironment 的返回的一个参数。

private[deploy] def prepareSubmitEnvironment(
    args: SparkSubmitArguments,
    conf: Option[HadoopConfiguration] = None)
: (Seq[String], Seq[String], SparkConf, String) = {
    ......
    // #1 声明初始化 childMainClass
    var childMainClass = ""
    ......
    // #2 判断集群模式，本次采用的是 yarn cluster模式
    if (isYarnCluster) {
        // #3 重新赋值
        childMainClass = YARN_CLUSTER_SUBMIT_CLASS
        if (args.isPython) {
            childArgs += ("--primary-py-file", args.primaryResource)
            childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
        } else if (args.isR) {
            val mainFile = new Path(args.primaryResource).getName
            childArgs += ("--primary-r-file", mainFile)
            childArgs += ("--class", "org.apache.spark.deploy.RRunner")
        } else {
            if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
                childArgs += ("--jar", args.primaryResource)
            }
            childArgs += ("--class", args.mainClass)
        }
        if (args.childArgs != null) {
            args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
        }
    }
    ......
    // #4 返回结果
    (childArgs.toSeq, childClasspath.toSeq, sparkConf, childMainClass)
}

追踪代码，可以得到 childMainClass：

private[deploy] val YARN_CLUSTER_SUBMIT_CLASS = "org.apache.spark.deploy.yarn.YarnClusterApplication"

// 即，变量赋值：
childMainClass = org.apache.spark.deploy.yarn.YarnClusterApplication

找不到 org.apache.spark.deploy.yarn.YarnClusterApplication 源码
在这里插入图片描述
解决方案如下，找到 spark-core模块的pom.xml 文件，添加依赖：

<!-- 导入spark-yarn的依赖，用于分析源码 -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-yarn_${scala.binary.version}</artifactId>
    <version>3.1.0</version>
</dependency>

在这里插入图片描述

// 可以看到 YarnClusterApplication 继承了 SparkApplication，
// 对应 runMain 方法中，对 mainClass 的判断选择
private[spark] class YarnClusterApplication extends SparkApplication {

    override def start(args: Array[String], conf: SparkConf): Unit = {
        // SparkSubmit would use yarn cache to distribute files & jars in yarn mode,
        // so remove them from sparkConf here for yarn mode.
        conf.remove(JARS)
        conf.remove(FILES)
        // #1 重点就是 Client
        new Client(new ClientArguments(args), conf, null).run()
    }
}

// 对应 runMain 方法中，对 mainClass 的判断选择
// #4 判断 mainClass 是否继承 SparkApplication 如果继承，进入 if，否则进入 else
val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass)) {
    // #5 mainClass 通过构造器创建实例，并且转换成 SparkApplication
    mainClass.getConstructor().newInstance().asInstanceOf[SparkApplication]
} else {
    // #6 mainClass new 一个 JavaMainApplication
    new JavaMainApplication(mainClass)
}
......
try {
    // #7 start
    app.start(childArgs.toArray, sparkConf)
} catch {
    case t: Throwable =>
    throw findCause(t)
}

YarnClusterApplication 其中有关参数的解析，ClientArguments 就是针对参数进行解析的方法。

最重要的就是Client。

// #1 Client.scala
private val yarnClient = YarnClient.createYarnClient

// #2 YarnClient.java
@Public
public static YarnClient createYarnClient() {
    YarnClient client = new YarnClientImpl();
    return client;
}

// #3 YarnClientImpl.java
protected ApplicationClientProtocol rmClient;

在这里插入图片描述
YarnClusterApplication 中 run()

new Client(new ClientArguments(args), conf, null).run()

// run()之后，调用 submitApplication()，返回 appId
def run(): Unit = {                  
    this.appId = submitApplication()
    ......
}

// 经过源码追踪，发现

submitApplication => createContainerLaunchContext => 会赋值参数：
amClass = "org.apache.spark.deploy.yarn.ApplicationMaster"（集群模式）
amClass = "org.apache.spark.deploy.yarn.ExecutorLauncher"（非集群模式）

双击 shift，查询 org.apache.spark.deploy.yarn.ApplicationMaster
在这里插入图片描述
在 ApplicationMaster 的 main 方法中，会调用 ApplicationMasterArguments 方法，来解析参数；

class ApplicationMasterArguments(val args: Array[String]) {
    ...
    parseArgs(args.toList)
    ...
    private def parseArgs(inputArgs: List[String]): Unit = {
        ...
        // 通过模式匹配，来解析参数
        case ("--jar") :: value :: tail =>
          userJar = value
          args = tail

        case ("--class") :: value :: tail =>
          userClass = value
          args = tail
        ...
    }
}

在 ApplicationMaster 的 main 方法中，会调用 ApplicationMaster 代码：

val yarnConf = new YarnConfiguration(SparkHadoopUtil.newConfiguration(sparkConf))
master = new ApplicationMaster(amArgs, sparkConf, yarnConf)

ApplicationMaster 会创建 Yarn RM Client 对象：

private val client = new YarnRMClient()

YarnRMClient 中 AMRMClient 含义是：创建 ApplicationMaster 和 ResourceManager的客户端；

private[spark] class YarnRMClient extends Logging {
  	private var amClient: AMRMClient[ContainerRequest] = _
    ...
}

在 ApplicationMaster 的 main 方法中，最后会调用 run 方法：

ugi.doAs(new PrivilegedExceptionAction[Unit]() {
    override def run(): Unit = System.exit(master.run())
})

run 方法，集群模式，调用 runDriver 方法

final def run(): Int = {
    ...
    if (isClusterMode) {
        runDriver()
    } else {
        runExecutorLauncher()
    }
    ...
}

runDriver 方法，调用方法 startUserApplication，启动用户的应用程序。

private def runDriver(): Unit = {
    addAmIpFilter(None, System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV))
    // 启动用户的应用程序
    userClassThread = startUserApplication()
    ...
    val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
                Duration(totalWaitTime, TimeUnit.MILLISECONDS))
    ...
}

startUserApplication 中代码，有一个类加载器，加载 args.userClass ，并且得到该类的 main 方法，然后创建一个线程，该线程名字叫 Driver，并且启动了。

private def startUserApplication(): Thread = {
    ...
    val mainMethod = userClassLoader.loadClass(args.userClass)
            .getMethod("main", classOf[Array[String]])
    val userThread = new Thread {
        override def run(): Unit = {
        	......
        }
    }
    userThread.setContextClassLoader(userClassLoader)
    userThread.setName("Driver")
    userThread.start()
    userThread
}

// 经过代码追踪，该 args.userClass 是 ApplicationMasterArguments 中，通过模式匹配 --class 获取的类，这个是提交程序时用过 --class 指定的类。
case ("--class") :: value :: tail =>
          userClass = value
          args = tail

在这里插入图片描述

启动 Driver ：

private def runDriver(): Unit = {
    addAmIpFilter(None, System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV))
    // 启动用户的应用程序
    userClassThread = startUserApplication()
    ...
    val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
                Duration(totalWaitTime, TimeUnit.MILLISECONDS))
    ...
    // 通信环境
    val rpcEnv = sc.env.rpcEnv
    ...
    // 注册AM，申请资源
    registerAM(host, port, userConf, sc.ui.map(_.webUrl), appAttemptId)
}

在这里插入图片描述

private def runDriver(): Unit = {
    addAmIpFilter(None, System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV))
    // 启动用户的应用程序
    userClassThread = startUserApplication()
    ...
    val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
                Duration(totalWaitTime, TimeUnit.MILLISECONDS))
    ...
    // 通信环境
    val rpcEnv = sc.env.rpcEnv
    ...
    // 注册AM，申请资源
    registerAM(host, port, userConf, sc.ui.map(_.webUrl), appAttemptId)
    ...
    // 创建分配器
    createAllocator(driverRef, userConf, rpcEnv, appAttemptId, distCacheConf)
}

在 runDriver 中，createAllocator 创建分配器：

private def createAllocator(
                                   driverRef: RpcEndpointRef,
                                   _sparkConf: SparkConf,
                                   rpcEnv: RpcEnv,
                                   appAttemptId: ApplicationAttemptId,
                                   distCacheConf: SparkConf): Unit = {
    ...
    // 该处的client 是 YarnRMClient
    allocator = client.createAllocator(
            yarnConf,
            _sparkConf,
            appAttemptId,
            driverUrl,
            driverRef,
            securityMgr,
            localResources)
    ...
    // yarn 返回的，通过分配器获取的，可分配资源
    allocator.allocateResources()
    
}

在这里插入图片描述

def allocateResources(): Unit = synchronized {
    updateResourceRequests()

    val progressIndicator = 0.1f
    // Poll the ResourceManager. This doubles as a heartbeat if there are no pending container
    // requests.
    val allocateResponse = amClient.allocate(progressIndicator)

    // 可分配容器
    val allocatedContainers = allocateResponse.getAllocatedContainers()
    allocatorBlacklistTracker.setNumClusterNodes(allocateResponse.getNumClusterNodes)
    // 如果可分配容器大于0，说明还有资源可供使用，即可进行分配
    if (allocatedContainers.size > 0) {
        logDebug(("Allocated containers: %d. Current executor count: %d. " +
                  "Launching executor count: %d. Cluster resources: %s.")
                 .format(
                     allocatedContainers.size,
                     getNumExecutorsRunning,
                     getNumExecutorsStarting,
                     allocateResponse.getAvailableResources))
        // 处理可用于分配的容器
        handleAllocatedContainers(allocatedContainers.asScala.toSeq)
    }
    ...
}

def handleAllocatedContainers(allocatedContainers: Seq[Container]): Unit = {
 	// 把可分配容器进行分类，比如同一台主机，同一个机架
    // 通过首选位置，可以把Task发送到最优的容器中进行执行
    ......
    // 运行已分配容器
    runAllocatedContainers(containersToUse)
}

在这里插入图片描述
runAllocatedContainers 运行可分配容器，使用线程池，启动 Executor，执行 run方法

private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = synchronized {
    for (container <- containersToUse) {
        if (rpRunningExecs < getOrUpdateTargetNumExecutorsForRPId(rpId)) {
            getOrUpdateNumExecutorsStartingForRPId(rpId).incrementAndGet()
            if (launchContainers) {
                // 线程池
                launcherPool.execute(() => {
                    try {
                        // 启动Executor
                            new ExecutorRunnable(
                                Some(container),
                                conf,
                                sparkConf,
                                driverUrl,
                                executorId,
                                executorHostname,
                                containerMem,
                                containerCores,
                                appAttemptId.getApplicationId.toString,
                                securityMgr,
                                localResources,
                                rp.id
                            ).run()	// run
                            updateInternalState()
                        } catch {
                        ...
                    }
                }
            }
        }
    }
}

run方法：

def run(): Unit = {
    logDebug("Starting Executor Container")
    // 创建与 NodeManager 连接客户端
    nmClient = NMClient.createNMClient()
    nmClient.init(conf)
    nmClient.start()
    // 启动 Container
    startContainer()
}

startContainer 启动 Container

def startContainer(): java.util.Map[String, ByteBuffer] = {
    ...
    // 环境信息，prepareCommand 准备指令
    val commands = prepareCommand()
    ctx.setCommands(commands.asJava)
    ...
    try {
        // 让某一个 NodeManager 启动 Container
        nmClient.startContainer(container.get, ctx)
    } catch {
        ...
    }
}

prepareCommand 准备指令：/bin/java org.apache.spark.executor.YarnCoarseGrainedExecutorBackend

private def prepareCommand(): List[String] = {
    ...
    val commands = prefixEnv ++
    Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
    javaOpts ++
    Seq("org.apache.spark.executor.YarnCoarseGrainedExecutorBackend",
        "--driver-url", masterAddress,
        "--executor-id", executorId,
        "--hostname", hostname,
        "--cores", executorCores.toString,
        "--app-id", appId,
        "--resourceProfileId", resourceProfileId.toString) ++
    userClassPath ++
    Seq(
        s"1>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stdout",
        s"2>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stderr")
    ...
}

在这里插入图片描述

双击 Shift 或者 Control + Shift + N，搜索 org.apache.spark.executor.YarnCoarseGrainedExecutorBackend，找到 main 方法：
在这里插入图片描述

调用 CoarseGrainedExecutorBackend 的 run 方法：

private[spark] object YarnCoarseGrainedExecutorBackend extends Logging {
    def main(args: Array[String]): Unit = {
        ...
        CoarseGrainedExecutorBackend.run(backendArgs, createFn)
        System.exit(0)
    }
}

run 方法，创建了 Executor 的运行环境，并且，创建了设置了 Endpoint ，终端

def run(
    arguments: Arguments,
    backendCreateFn: (RpcEnv, Arguments, SparkEnv, ResourceProfile) =>
    CoarseGrainedExecutorBackend): Unit = {
    ...
    driverConf.set(EXECUTOR_ID, arguments.executorId)
    // Executor 运行环境
    val env = SparkEnv.createExecutorEnv(driverConf, arguments.executorId, arguments.bindAddress,
                                         arguments.hostname, arguments.cores, cfg.ioEncryptionKey,
                                         isLocal = false)
    // Endpoint 终端
    // backendCreateFn 把创建的对象设定为终端
    env.rpcEnv.setupEndpoint("Executor",
                             backendCreateFn(env.rpcEnv, arguments, env, cfg.resourceProfile))
    arguments.workerUrl.foreach { url =>
        env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
    }
    env.rpcEnv.awaitTermination()
}

对 setupEndpoint 方法进行查看，查看到其抽象方法：

def setupEndpoint(name: String, endpoint: RpcEndpoint): RpcEndpointRef

使用快捷键 Control + Alt + B，查看抽象方法的实现：

override def setupEndpoint(name: String, endpoint: RpcEndpoint): RpcEndpointRef = {
    // 注册 RPC 通信的终端
    dispatcher.registerRpcEndpoint(name, endpoint)
}

def registerRpcEndpoint(name: String, endpoint: RpcEndpoint): NettyRpcEndpointRef = {
    // 通信地址
    val addr = RpcEndpointAddress(nettyEnv.address, name)
    // 通信引用
    val endpointRef = new NettyRpcEndpointRef(nettyEnv.conf, addr, nettyEnv)
    synchronized {
        if (stopped) {
            throw new IllegalStateException("RpcEnv has been stopped")
        }
        if (endpoints.containsKey(name)) {
            throw new IllegalArgumentException(s"There is already an RpcEndpoint called $name")
        }

        endpointRefs.put(endpoint, endpointRef)
		// 消息循环器
        var messageLoop: MessageLoop = null
        try {
            messageLoop = endpoint match {
                case e: IsolatedRpcEndpoint =>
                new DedicatedMessageLoop(name, e, this)
                case _ =>
                sharedLoop.register(name, endpoint)
                sharedLoop
            }
            endpoints.put(name, messageLoop)
        } catch {
            case NonFatal(e) =>
            endpointRefs.remove(endpoint)
            throw e
        }
    }
    endpointRef
}

调用 DedicatedMessageLoop 有 inbox，threadpool:

private class DedicatedMessageLoop(
                                      name: String,
                                      endpoint: IsolatedRpcEndpoint,
                                      dispatcher: Dispatcher)
    extends MessageLoop(dispatcher) {

    private val inbox = new Inbox(name, endpoint)

    override protected val threadpool = if (endpoint.threadCount() > 1) {
        ThreadUtils.newDaemonCachedThreadPool(s"dispatcher-$name", endpoint.threadCount())
    } else {
        ThreadUtils.newDaemonSingleThreadExecutor(s"dispatcher-$name")
    }
}

追踪 Inbox 收件箱代码

private[netty] class Inbox(val endpointName: String, val endpoint: RpcEndpoint) extends Logging {
    inbox =>
    @GuardedBy("this")
    protected val messages = new java.util.LinkedList[InboxMessage]()
    ...
    // OnStart 表示消息，发给自己
    inbox.synchronized {
        messages.add(OnStart)
    }
    ...
}

OnStart 方法就会触发 CoarseGrainedExecutorBackend 的 OnStart 方法执行：

override def onStart(): Unit = {
    ...
    rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
        // 得到 Driver
        driver = Some(ref)
        // 向 Driver 发送消息 RegisterExecutor，完成注册
        ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls,
                                          extractAttributes, _resources, resourceProfile.id))
    }(ThreadUtils.sameThread).onComplete {
        case Success(_) =>
        self.send(RegisteredExecutor)
        case Failure(e) =>
        exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
    }(ThreadUtils.sameThread)
}

Driver 中肯定会有接收端，来接收请求，双击 Shift 查找 SparkContext：

// 通信后台
private var _schedulerBackend: SchedulerBackend = _

追踪 SchedulerBackend

private[spark] trait SchedulerBackend {
    ...
}

Control + Alt +B

class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: RpcEnv)
    extends ExecutorAllocationClient with SchedulerBackend with Logging {
    ...
    class DriverEndpoint extends IsolatedRpcEndpoint with Logging {
        override def onStart(): Unit = {
            // onStart 方法
            ...
        }
        ...
        override def receive: PartialFunction[Any, Unit] = {
            // 接收消息的方法
            ...
        }
        ...
        override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
            // 接收和回复消息的方法，通过模式匹配接收 RegisterExecutor 消息
            case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls,
            attributes, resources, resourceProfileId) =>
            	context.reply(true) // 经过一系列操作，返回注册成功
            ...
        }
    }
    ...
}

CoarseGrainedExecutorBackend 的 OnStart 方法，如果接收到返回成功，就会给自己发送消息：

override def onStart(): Unit = {
    ...
    rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
        // 得到 Driver
        driver = Some(ref)
        // 向 Driver 发送消息 RegisterExecutor，完成注册
        ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls,
                                          extractAttributes, _resources, resourceProfile.id))
    }(ThreadUtils.sameThread).onComplete {
        case Success(_) =>
        // 如果成功，自己给自己发送一条消息，说明注册成功
        self.send(RegisteredExecutor)
        case Failure(e) =>
        exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
    }(ThreadUtils.sameThread)
}

注册成功后，会接收消息，调用 receive 方法：

override def receive: PartialFunction[Any, Unit] = {
    // 模式匹配，注册成功消息
    case RegisteredExecutor =>
    logInfo("Successfully registered with driver")
    try {
        // 实例化 Executor
        executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false,
                                resources = _resources)
        driver.get.send(LaunchedExecutor(executorId))
    } catch {
        case NonFatal(e) =>
        exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
    }
    ...
}

在这里插入图片描述
此时，已经完成了 SparkSubmit 的资源环境申请准备。但是 Driver 会继续执行。

private def runDriver(): Unit = {
    addAmIpFilter(None, System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV))
    // 启动用户的应用程序
    userClassThread = startUserApplication()
	...
    val totalWaitTime = sparkConf.get(AM_MAX_WAIT_TIME)
    try {
        val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
                                         Duration(totalWaitTime, TimeUnit.MILLISECONDS))
        if (sc != null) {
            val rpcEnv = sc.env.rpcEnv

            val userConf = sc.getConf
            val host = userConf.get(DRIVER_HOST_ADDRESS)
            val port = use rConf.get(DRIVER_PORT)
            // 注册AM，申请资源
            registerAM(host, port, userConf, sc.ui.map(_.webUrl), appAttemptId)

            val driverRef = rpcEnv.setupEndpointRef(
                RpcAddress(host, port),
                YarnSchedulerBackend.ENDPOINT_NAME)
            // 创建分配器
            createAllocator(driverRef, userConf, rpcEnv, appAttemptId, distCacheConf)
        } else {
            throw new IllegalStateException("User did not initialize spark context!")
        }
        // Driver 继续执行
        resumeDriver()
        userClassThread.join()
    } catch {
        ...
    } finally {
        resumeDriver()
    }
}

在 SparkContext 中

// 表明 准备工作已经做好
_taskScheduler.postStartHook()

// 代码追踪
def postStartHook(): Unit = { }

// Control + Alt + B
override def postStartHook(): Unit = {
    waitBackendReady()
}

// 针对 waitBackendReady 进行代码追踪，状态循环等待， 使 Driver 程序继续执行
private def waitBackendReady(): Unit = {
    if (backend.isReady) {
        return
    }
    while (!backend.isReady) {
        // Might take a while for backend to be ready if it is waiting on resources.
        if (sc.stopped.get) {
            // For example: the master removes the application for some reason
            throw new IllegalStateException("Spark context stopped while waiting for backend")
        }
        synchronized {
            this.wait(100)
        }
    }
}

保持 Driber 继续执行之后，程序会继续执行后续的逻辑代码，比如说：WordCount
在这里插入图片描述

3.名词解析

RPCEnv：通信环境

Backend：后台

Endpoint：终端

@TangXin

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
SparkSubmit-提交流程源码分析

文章目录1.提交命令2.源码分析3.名词解析1.提交命令在实际生产中，都是使用 yarn-cluster 模式提交 spark 任务，例如：spark-submit \--class org.apache.spark.examples.SparkPi \--master yarn \--deploy-mode client \./examples/jars/spark-examples_2.11-2.3.2.3.1.0.0-78.jar \102.源码分析执行提交命令之后，首先会调用
复制链接

扫一扫