内容:
1. Master接受Driver注册
2. Master接受Application注册
3. Master接受worker注册
4. Master处理Driver状态变化
5. Master处理Executor状态变化
一、Master注册的处理机制
1.注册对象
Master接收注册的对象主要就是:Driver、Application、Worker;
补充说明:Executor不会注册给Master,Executor是注册给Driver中的schedulerBackend的;
2.Worker的注册
Worker是在启动后主动向Master注册的,所以若在生产环境下加入新的worker到已经在运行的Spark集群中,此时,不需要重新启动Spark集群就能够使用新加入的worker,用来提升处理能力。
源码:
override def onStart() {
assert(!registered)
logInfo("Starting Spark worker %s:%d with %d cores, %s RAM".format(
host, port, cores, Utils.megabytesToString(memory)))
logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")
logInfo("Spark home: " + sparkHome)
createWorkDir()
shuffleService.startIfEnabled()
webUi = new WorkerWebUI(this, workDir, webUiPort)
webUi.bind()
registerWithMaster()
metricsSystem.registerSource(workerSource)
metricsSystem.start()
// Attach the worker metrics servlet handler to the web ui after the metrics system is started.
metricsSystem.getServletHandlers.foreach(webUi.attachHandler)
}
3.Master的响应
Master在接受到worker的注册的请求后,首先会判断一下当前的Master是否是standby模式,如果是,就会不进行处理,然后会判断当前的Master的内存数据结构(idToWorker)中是否已经有该worker的注册信息,如果有,此时不会进行重复注册。
4.Master接受worker注册
Master如果决定接受注册的worker,首先会创建workerInfo对象来保存注册的worker的信息,如下代码
private[spark] class WorkerInfo(
val id: String,
val host: String,
val port: Int,
val cores: Int,
val memory: Int,
val endpoint: RpcEndpointRef,
val webUiPort: Int,
val publicAddress: String)
extends Serializable
然后调用registwork来执行具体的注册的过程(在注册时会过滤掉以前dead掉的,现在又进行注册的worker节点),参见代码
if (registerWorker(worker)) {
persistenceEngine.addWorker(worker)
context.reply(RegisteredWorker(self, masterWebUiUrl))
schedule()
} else {
val workerAddress = worker.endpoint.address
logWarning("Worker registration failed. Attempted to re-register worker at same " +
"address: " + workerAddress)
context.reply(RegisterWorkerFailed("Attempted to re-register worker at same address: "
+ workerAddress))
}
private def registerWorker(worker: WorkerInfo): Boolean = {
// There may be one or more refs to dead workers on this same node (w/ different ID's),
// remove them.
workers.filter { w =>
(w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD)
}.foreach { w =>
workers -= w
}
最后调用registerworker来执行具体的注册的过程,如果worker的状态是DEAD,则直接过滤掉,对于UNKNOW状态的内容调用removeworker进行清理(包括清理worker下的Executors和Drivers)。
5.注册顺序
注册时,实现注册Driver,然后再注册Application。
二、Master的状态管理
1.Driver状态变化
Driver的状态可分为 ERROR|FINISHED|KILLED|FAILED以及其他状态。
当Driver的状态为 ERROR|FINISHED|KILLED| FAILED的时候,remover掉Driver;当Driver的状态为其他的情况时,发送异常。源码如下:
case DriverStateChanged(driverId, state, exception) => {
state match {
case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
removeDriver(driverId, state, exception)
case _ =>
throw new Exception(s"Received unexpected state update for driver $driverId: $state")
}
}
2.Executor状态的变化
获得Executor的状态后,查询是否存在Executor,Executor挂掉时,系统会尝试一定次数的重启(最多10次),重启次数多于10次,则会remove掉这个Application。
三、执行流程图
说明: 笔记来源DT大数据IMF课程。