Spark学习-2.4.0-源码分析-2-Spark 核心篇-Master状态改变处理机制原理剖析与源码分析

最新推荐文章于 2020-01-20 16:10:10 发布

pre_tender

最新推荐文章于 2020-01-20 16:10:10 发布

阅读量378

点赞数

分类专栏： Saprk

本文链接：https://blog.csdn.net/pre_tender/article/details/100192159

版权

Saprk 专栏收录该内容

47 篇文章 8 订阅

订阅专栏

文章目录

1. Master状态改变处理内容
2. 状态改变处理机制与源码分析
致谢

1. Master状态改变处理内容

首先对于Master来说，存在Active Master和Standby Master两种，并且Driver、Workder、Application都要向Master进行注册。因此Master状态改变处理内容主要包括一下几点：

处理Master状态改变
处理Driver状态改变
处理Worker状态改变
处理Application状态改变

2. 状态改变处理机制与源码分析

2.1 处理Master状态改变

对于Master来说，会基于Zookeeper的选举机制或者基于文件系统的手动切换机制，确定一个Active Master，此外还存在多个Standby Master。而无论是Active Master还是Standby Master，一旦新的Active Master确定，那么二者的状态都要发生改变。因此这里分别对二者的处理进行简单说明分析，但是具体的代码分析参见Master主备切换机制。

2.1.1 Active Master不再Active

首先说明一下，在Master状态改变中Active Master不再Active是存在两种情况的：

如果是基于Zookeeper，那么非常有可能是当前的Active Master出现了状况（比如长时间失去联系），这个时候会重新选举一个Leader,而Active Master可能直接退出当前进程了。
如果是基于文件系统的手动切换机制，那么是管理员考虑到某些因素而切换主备Master，这个时候Active Master很大概率是成为Standby Master。
这部分参见Master主备切换机制–Active Master不再Active

2.1.2 Stand Master变为Active Master

这部分参见Master主备切换机制–Stand Master变为Active Master

2.2 处理Driver状态改变

Master接收的消息为DriverStateChanged(driverId, state, exception)，这个消息来源可能是：待续
对于state为ERROR，FINISHED，KILLED，FAILED的Driver，一律使用removeDriver()移除他们的信息。
在removeDriver()中，按driverid查找到这个Driver：

将此driverid移出drivers列表
加入completedDrivers列表中
移出持久化引擎列表
告知worker移除Driver，释放资源
使用schedule()来调度，指定下一个任务。


**代码如下：**
---------------org.apache.spark.deploy.master.Master.scala----------------------
override def receive: PartialFunction[Any, Unit] = {
  	.......
	    case DriverStateChanged(driverId, state, exception) =>
      state match {
        case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
          removeDriver(driverId, state, exception)
        case _ =>
          throw new Exception(s"Received unexpected state update for driver $driverId: $state")
      }
    ......
}

2.3 处理Executor状态改变

首先通过executor传过来的appid获取App，再从App内部的Executors缓存获得 ExecutorDescription
模式匹配，看看execOption是否有值，如果有，并且是exec那么执行以下逻辑:
- 更新Executor的状态，并告知Driver
- 对于KILLED, FAILED, LOST, EXITED状态的exec,从worker和app中移除executor的信息
- 根据executor的exitStatus和重试次数判定要不要从Master中清除Application的信息
- 使用Schedule()进行调度。

    case ExecutorStateChanged(appId, execId, state, message, exitStatus) =>
      // 1. 存储信息格式为HashMap: (addId,App),(execId,Exec)
      //    因此这里首先通过Executor返回的appId获取到了App,再从App内部的Executors缓存获得 ExecutorDescription
      val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
      execOption match {
        // 2. 如果有值
        case Some(exec) =>
          // 2.2 更新Executor的oldstate为新的state
          val appInfo = idToApp(appId)
          val oldState = exec.state
          exec.state = state
          // 2.2 如果新state是RUNNING，那么判断oldState是不是LAUNCHING，
          //     因为state从LAUNCHING转换为RUNNING是不合法的。
          if (state == ExecutorState.RUNNING) {
            assert(oldState == ExecutorState.LAUNCHING,
              s"executor $execId state transfer from $oldState to RUNNING is illegal")
            appInfo.resetRetryCount()//如果是，那么重置 重试次数。
          }
          // 2.3 向Driver同步发送当下Executor的状态信息
          exec.application.driver.send(ExecutorUpdated(execId, state, message, exitStatus, false))

          // 2.4 如果Executor的状态为完成状态：KILLED, FAILED, LOST, EXITED
          if (ExecutorState.isFinished(state)) {
            // 从Worker和App中移除此Executor
            logInfo(s"Removing executor ${exec.fullId} because it is $state")
            // 如果一个Application已经被完成，则保存其信息，显示在前端页面
            // 从App的缓存中移除Executor
            if (!appInfo.isFinished) {
              appInfo.removeExecutor(exec)
            }
            
            //从运行Executor的Worker的缓存中移除Executor
            exec.worker.removeExecutor(exec)
				
			// 后面主要是判定要不要移除从Master中Application的信息。移除的时候也会通知Driver.
            val normalExit = exitStatus == Some(0)
            // 只需要重试一定次数，这样我们就不会进入无限循环
            // 如果退出的状态不正常，并且EXECUTOR重试的次数 >= MAX_EXECUTOR_RETRIES[10次]，则 removeApplication
            if (!normalExit
                && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
                && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
              val execs = appInfo.executors.values
              if (!execs.exists(_.state == ExecutorState.RUNNING)) {
                logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
                  s"${appInfo.retryCount} times; removing it")
                removeApplication(appInfo, ApplicationState.FAILED)
              }
            }
          }
          // 2.5 重新调度执行
          schedule()
        case None =>
          logWarning(s"Got status update for unknown executor $appId/$execId")
      }