ClientEndpoint发送RegisterApplication请求,Master返回RegisteredApplication注册成功消息,到这里application注册就完成了;接下来就是启动Executors,schedule()是启动Exexutors的入口
private def schedule(): Unit = {
if (state != RecoveryState.ALIVE) {
return
}
// 随机打乱所有的worker,避免在一个worker上启动过多的dirver;这里需要说明的是worker启动后会向master注册,
//注册完后master就有与worker通信的workendpointRef,
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
var launched = false
var numWorkersVisited = 0
while (numWorkersVisited < numWorkersAlive && !launched) {
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
launched = true
}
curPos = (curPos + 1) % numWorkersAlive
}
}
//在worker上启动Executor
startExecutorsOnWorkers()
}
在schedule()最后一行就是启动Executors的具体实现startExecutorsOnWorkers(),具体方法的调用流程:
startExecutorsOnWorkers()->allocateWorkerResourceToExecutors()->launchExecutor(),这里重点关注launchExecutor(),之前的方法调用我们可以暂时忽略
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
worker.addExecutor(exec)
//向worker发送LaunchExecutor消息
worker.endpoint.send(LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
exec.application.driver.send(
ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
}
在launchExecutor方法里,首先是用worker.endpoint.send(LaunchExecutor)请求,worker接收到请求后,首先创建executor的工作目录;
val executorDir = new File(workDir, appId + "/" + execId)
之后创建ExecutorRunner,并且调用start()方法,并且给worker和Master发送ExecutorStateChanged消息
val manager = new ExecutorRunner(
appId,
execId,
appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
cores_,
memory_,
self,
workerId,
host,
webUi.boundPort,
publicAddress,
sparkHome,
executorDir,
workerUri,
conf,
appLocalDirs, ExecutorState.RUNNING)
executors(appId + "/" + execId) = manager
//启动Executor,在start方法里会向worker发送ExecutorStateChanged消息
manager.start()
coresUsed += cores_
memoryUsed += memory_
//向master发送ExecutorStateChanged消息
sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))
先来看下ExecutorRunner.start()方法,代码很简单,就是构造了一个线程,线程内部调用fetchANdRunExecutor()
workerThread = new Thread("ExecutorRunner for " + fullId) {
override def run() { fetchAndRunExecutor() }
}
workerThread.start()
fetchAndRunExecutor这个方法主要用ProcessBuilder拼接Linux命令行启动Executor,
val builder = CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),
memory, sparkHome.getAbsolutePath, substituteVariables)
process = builder.start()
worker.send(ExecutorStateChanged(appId, execId, state, Some(message), Some(exitCode)))
Executor启动的时候同时给worker和master发送了ExecutorStateChanged消息,首先worker接收到消息后,直接就将改消息发送给了master;具体可以查看Worker中的handleExecutorStateChanged方法
sendToMaster(executorStateChanged)
Executor发送的消息最后都到了Master,Master收到消息后,给Dirver发送ExecutorUpdated消息,这里的Dirver也就是ClientEndpoint;
ClientEndpoint接收到消息后,打印了下状态信息,以及根据Executor的状态决定是否需要移除Executor
同时master会根据Executor的状态来决定时候需要移除Executor,最后再次调用schedule()方法,具体代码细节可以根据流程图去追踪下