Spark 支持多种运行模式,这些运行模式中的集群管理器会为任务分配运行资源,在运行资源中启动 Executor,由 Executor 负责执行任务的运行,最终把任务运行状态发送给 Driver。
以独立运行(standalone )模式为例分析 Executor 出现异常的情况,其运行结构如下图所示,其中虚线为正常运行中进行消息通信线路,实现为异常处理步骤。
(1) 在 standalone 模式中,提交一个程序后,集群中的 Master 给应用程序分配运行资源,然后在Worker 中启动 ExecutorRunner,而 ExecutorRunner 根据当前的运行模式启动 CoarseGrainedExecutorBackend 进程,该进程启动后会向 Driver 发送 RegisterExecutor 注册信息,如果注册成功,则 CoarseGrainedExecutorBackend 在其内部启动 Executor。Executor 由 ExecutorRunner 进行管理,当 Executor 出现异常(如所运行容器 CoarseGrainedExecutorBackend 进程异常退出等)时,由ExecutorRunner 捕获该异常并发送 ExecutorStateChanged 消息给 Worker。
Worker # launchExecutor:
case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
...
val manager = new ExecutorRunner(
appId,
execId,
appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
cores_,
memory_,
self,
workerId,
host,
webUi.boundPort,
publicAddress,
sparkHome,
executorDir,
workerUri,
conf,
appLocalDirs, ExecutorState.RUNNING)
executors(appId + "/" + execId) = manager
manager.start()
...
ExecutorRunner # start:
private[worker] def start() {
workerThread = new Thread("ExecutorRunner for " + fullId) {
override def run() { fetchAndRunExecutor() }
}
workerThread.start()
// Shutdown hook that kills actors on shutdown.
shutdownHook = ShutdownHookManager.addShutdownHook { () =>
// It's possible that we arrive here before calling `fetchAndRunExecutor`, then `state` will
// be `ExecutorState.RUNNING`. In this case, we should set &