Task运行状态管理
TaskRunner.run()线程在执行的时候,会对Task运行状态进行管理,在Task开始启动的时候,会产生一个StatusUpdate事件,它的功能主要是将Task的状态设置为RUNNING;当Task运行结束的时候,又产生了一个消息Task.FINISHED;假设task运行过程中出错,那么会产生Task.FAILED的消息。以上这些消息会通过execBackend.statusUpdate()方法发送给Driver。下面我们来看statusUpdate()方法,它是一个抽象方法,它调用的是CoarseGrainedExecutorBackend的statusUpdate()方法。
override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer) {
// 将其封装为一个message
val msg = StatusUpdate(executorId, taskId, state, data)
driver match {
// 发送StatusUpdate消息给Driver -- SparkDeploySchedulerBackend
case Some(driverRef) => driverRef.send(msg)
case None => logWarning(s"Drop $msg because has not yet connected to driver")
}
}
上述方法的功能,就是将task状态封装为StatusUpdate消息,然后发送给Driver。我们看一下SparkDeploySchedulerBackend的父类CoarseGrainedSchedulerBackend中接收消息的receive()方法:
override def receive: PartialFunction[Any, Unit] = {
// 处理Task执行结束的事件
case StatusUpdate(executorId, taskId, state, data) =>
// 调用TaskSchedulerImpl的statusUpdate()方法
scheduler.statusUpdate(taskId, state, data.value)
if (TaskState.isFinished(state)) {
executorDataMap.get(executorId) match {
case Some(executorInfo) =>
executorInfo.freeCores += scheduler.CPUS_PER_TASK
makeOffers(executorId)
case None =>
// Ignoring the update since we don't know about the executor.
logWarning(s"Ignored task status update ($taskId state $state) " +
s"from unknown executor with ID $executorId")
}
}
}
它里面又调用了TaskSchedulerImpl的statusUpdate()方法,下面简单分析一下这个方法:
def statusUpdate(tid: Long, state: TaskState, serializedData: ByteBuffer) {
var failedExecutor: Option[String] = None
synchronized {
try {
// 实际项目中,可能会遇到task lost,原因很多,下面就是针对task lost后做的处理
// 判断如果Task如果Lost,并且是之前发送过来的task
if (state == TaskState.LOST && taskIdToExecutorId.contains(tid)) {
// We lost this entire executor, so remember that it's gone
// 取出它的executor ID
val execId = taskIdToExecutorId(tid)
if (executorIdToTaskCount.contains(execId)) {
// 移除executor
removeExecutor(execId,
SlaveLost(s"Task $tid was lost, so marking the executor as lost as well."))
failedExecutor = Some(execId)
}
}
// 如果Task运行结束
taskIdToTaskSetManager.get(tid) match {
// 获取对应的taskset
case Some(taskSet) =>
// 如果task是非正常结束了
if (TaskState.isFinished(state)) {
// 移除task,取消监控
taskIdToTaskSetManager.remove(tid)
taskIdToExecutorId.remove(tid).foreach { execId =>
if (executorIdToTaskCount.contains(execId)) {
executorIdToTaskCount(execId) -= 1
}
}
}
// 如果task是正常结束
if (state == TaskState.FINISHED) {
// 移除task
taskSet.removeRunningTask(tid)
// 管理Task的结果数据
taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)
} else if (Set(TaskState.FAILED, TaskState.KILLED, TaskState.LOST).contains(state)) {
taskSet.removeRunningTask(tid)
taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData)
}
case None =>
logError(
("Ignoring update with state %s for TID %s because its task set is gone (this is " +
"likely the result of receiving duplicate task finished status updates)")
.format(state, tid))
}
} catch {
case e: Exception => logError("Exception in statusUpdate", e)
}
}
// Update the DAGScheduler without holding a lock on this, since that can deadlock
if (failedExecutor.isDefined) {
dagScheduler.executorLost(failedExecutor.get)
backend.reviveOffers()
}
}
它主要是对Task出现的状态进行处理,比如Task Lost,task正常结束还是非正常结束等。
总结一下,这里主要就是针对task在运行过程中产生的状态进行处理,Task运行状态主要是5种:running什么也没有做,然后是failed、lost、killed 和 finished四种状态进行处理,其中finished,会将序列化的结果数据信息进行保存,其他的状态就是移除task的状态信息,并重新提交task运行。