一 Worker注册
1. Worker启动之后,主动向Master进行注册。
2. Master在receive方法中接收Worker的注册消息。
3. 判断Master是否为Standby节点,如果是,则返回消息 MasterInStandby,结束。否则,继续执行下面的流程。
4. 根据Worker的id判断Worker是否已经注册过,注册过的话,返回注册失败:"Duplicate worker ID",结束。否则,继续执行下面的流程。
5. 注册Worker。
5.1 过滤掉同一个节点上的状态为DEAD的节点。
5.2 根据Worker的地址判断是否已经包含这个地址的Worker,包含的话,则移除旧的Worker。
5.3 将Worker添加到内存缓存中。
6. 用持久化引擎将Worker信息进行持久化。
7. schedule()。
// 2
case RegisterWorker(
id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl, masterAddress) =>
logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
workerHost, workerPort, cores, Utils.megabytesToString(memory)))
if (state == RecoveryState.STANDBY) { // 3
workerRef.send(MasterInStandby)
} else if (idToWorker.contains(id)) { // 4
workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
} else {
val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
workerRef, workerWebUiUrl)
if (registerWorker(worker)) { // 5
persistenceEngine.addWorker(worker) // 6
workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress))
schedule() // 7
} else {
val workerAddress = worker.endpoint.address
logWarning("Worker registration failed. Attempted to re-register worker at same " +
"address: " + workerAddress)
workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: "
+ workerAddress))
}
}
private def registerWorker(worker: WorkerInfo): Boolean = {
// There may be one or more refs to dead workers on this same node (w/ different ID's),
// remove them. // 5.1
workers.filter { w =>
(w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD)
}.foreach { w =>
workers -= w
}
val workerAddress = worker.endpoint.address
if (addressToWorker.contains(workerAddress)) { // 5.2
val oldWorker = addressToWorker(workerAddress)
if (oldWorker.state == WorkerState.UNKNOWN) {
// A worker registering from UNKNOWN implies that the worker was restarted during recovery.
// The old worker must thus be dead, so we will remove it and accept the new worker.
removeWorker(oldWorker, "Worker replaced by a new worker with same address")
} else {
logInfo("Attempted to re-register worker at same address: " + workerAddress)
return false
}
}
// 5.3
workers += worker // HashSet 存放WorkerInfo信息
idToWorker(worker.id) = worker // HashMap id <--> Worker 对应关系
addressToWorker(workerAddress) = worker // HashMap address <--> Worker 对应关系
true
}
二 Application注册
1. Driver启动好后,执行用户编写的Application代码,执行SparkContext初始化,底层的xxxSchedulerBackend会通过RPCEnv发送RegisterApplication到Master进行注册。
2. 判断Master是否为Standby节点,如果是,则什么都不做,流程结束。否则继续执行下面的流程。
3. 创建Application,包括创建时间、ApplicationId、Driver、使用的cores等。
4. 注册Application。
4.1 根据appAddress判断是否已经包含Application,已经包含的话,则直接返回,流程结束。
4.2 注册Application资源信息。
4.3 将Application添加到内存缓存。
4.4 将Application添加到等待调度的缓存中。
5. 用持久化引擎将Application信息进行持久化。
6. schedule()。
// 1
case RegisterApplication(description, driver) =>
// TODO Prevent repeated registrations from some driver
if (state == RecoveryState.STANDBY) { // 2
// ignore, don't send response
} else {
logInfo("Registering app " + description.name)
val app = createApplication(description, driver) // 3
registerApplication(app) // 4
logInfo("Registered app " + description.name + " with ID " + app.id)
persistenceEngine.addApplication(app) // 5
driver.send(RegisteredApplication(app.id, self))
schedule() // 6
}
private def registerApplication(app: ApplicationInfo): Unit = {
val appAddress = app.driver.address
if (addressToApp.contains(appAddress)) { // 4.1
logInfo("Attempted to re-register application at same address: " + appAddress)
return
}
applicationMetricsSystem.registerSource(app.appSource) // 4.2
// 4.3
apps += app // HashSet,存放app信息
idToApp(app.id) = app // HashMap id <--> app
endpointToApp(app.driver) = app // HashMap RpcEndpointRef <--> app
addressToApp(appAddress) = app // HashMap appAddress <--> app
// 4.4
waitingApps += app
}