Driver和Executor的状态改变机制
我们先来看Driver的状态改变机制
源码如下
case DriverStateChanged(driverId, state, exception) =>
state match {
case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
removeDriver(driverId, state, exception)
case _ =>
throw new Exception(s"Received unexpected state update for driver $driverId: $state")
}
这里可以看出如果状态为ERROR、FINISHED、KILLED、FAILED就会调用
removeDriver(driverId, state, exception)
该函数是一个private函数它的源码为
private def removeDriver(
driverId: String,
finalState: DriverState,
exception: Option[Exception]) {
drivers.find(d => d.id == driverId) match {
case Some(driver) =>
logInfo(s"Removing driver: $driverId")
drivers -= driver
if (completedDrivers.size >= RETAINED_DRIVERS) {
val toRemove = math.max(RETAINED_DRIVERS / 10, 1)
completedDrivers.trimStart(toRemove)
}
completedDrivers += driver
persistenceEngine.removeDriver(driver)
driver.state = finalState
driver.exception = exception
driver.worker.foreach(w => w.removeDriver(driver))
schedule()
case None =>
logWarning(s"Asked to remove unknown driver: $driverId")
}
}
主要做了一下功能
- 找到driverId对应的driver,如果找到了就从内存中(HashSet类型的drivers)中移除该driver
- 向completedDrivers中加入该driver
- 使用持久化引擎去除driver的持久化信息
- 设置driver的state和exeception
- 将driver所在的worker中的该driver移除
- 调用scheduler方法
Executor状态改变
case ExecutorStateChanged(appId, execId, state, message, exitStatus) =>
val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
execOption match {
case Some(exec) =>
val appInfo = idToApp(appId)
val oldState = exec.state
exec.state = state
if (state == ExecutorState.RUNNING) {
assert(oldState == ExecutorState.LAUNCHING,
s"executor $execId state transfer from $oldState to RUNNING is illegal")
appInfo.resetRetryCount()
}
exec.application.driver.send(ExecutorUpdated(execId, state, message, exitStatus, false))
if (ExecutorState.isFinished(state)) {
// Remove this executor from the worker and app
logInfo(s"Removing executor ${exec.fullId} because it is $state")
// If an application has already finished, preserve its
// state to display its information properly on the UI
if (!appInfo.isFinished) {
appInfo.removeExecutor(exec)
}
exec.worker.removeExecutor(exec)
val normalExit = exitStatus == Some(0)
// Only retry certain number of times so we don't go into an infinite loop.
// Important note: this code path is not exercised by tests, so be very careful when
// changing this `if` condition.
if (!normalExit
&& appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
&& MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
val execs = appInfo.executors.values
if (!execs.exists(_.state == ExecutorState.RUNNING)) {
logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
s"${appInfo.retryCount} times; removing it")
removeApplication(appInfo, ApplicationState.FAILED)
}
}
}
schedule()
case None =>
logWarning(s"Got status update for unknown executor $appId/$execId")
}
他会首先根据executor对应app,然后根据app内部的的executors缓存获取executor信息类型为Option[ExecutorDesc]
如果找到了这个executor,先设置executor的当前状态然后看如下代码
exec.application.driver.send(ExecutorUpdated(execId, state, message, exitStatus, false))
这段代码是向driver同步exexutor的状态
然后判断executor运行状态,如果是完成了(FISHINED),但是Application没有完成就从app缓存中移除exexutor,然后再从移除executor的worker缓存中移除executor。否则直接从executor的worker中移除executor
如果是非正常退出,判断application当前的重试次数,如果大于等于(默认为10次,也可以自行设置)spark.deploy.maxExecutorRetries,然后再确认没有running状态的executor之后就执行 移除Application操作。