spark源码系列(2) Master

本节主要分两个模块来介绍

1.Master的主备切换、注册与状态变更

2.资源调度

-------------------------------

先看Master的主备切换。先来一张流程图

找到Master.scala

onstart方法:

val serializer = new JavaSerializer(conf)
val (persistenceEngine_, leaderElectionAgent_) = RECOVERY_MODE match {
  case "ZOOKEEPER" =>
    logInfo("Persisting recovery state to ZooKeeper")
    val zkFactory =
      new ZooKeeperRecoveryModeFactory(conf, serializer)
    (zkFactory.createPersistenceEngine(), zkFactory.createLeaderElectionAgent(this))

这里会创建一个ZooKeeperRecoveryModeFactory ,这个对象中有两个方法,一个是createPersistenceEngine,用于创建持久化引擎,另一个是createLeaderElectionAgent用于选举Leader

进入ZooKeeperPersistenceEngine可以看到很多方法

override def persist(name: String, obj: Object): Unit = {
  serializeIntoFile(WORKING_DIR + "/" + name, obj)
}

override def unpersist(name: String): Unit = {
  zk.delete().forPath(WORKING_DIR + "/" + name)
}

override def read[T: ClassTag](prefix: String): Seq[T] = {
  zk.getChildren.forPath(WORKING_DIR).asScala
    .filter(_.startsWith(prefix)).map(deserializeFromFile[T]).flatten
}

实际上就是一个zkClient的一些交互

着重看Master.scala的recevie方法

override def receive: PartialFunction[Any, Unit] = {
  case ElectedLeader => {
    val (storedApps, storedDrivers, storedWorkers) = persistenceEngine.readPersistedData(rpcEnv)
    state = if (storedApps.isEmpty && storedDrivers.isEmpty && storedWorkers.isEmpty) {
      RecoveryState.ALIVE
    } else {
      RecoveryState.RECOVERING
    }
    logInfo("I have been elected leader! New state: " + state)
    if (state == RecoveryState.RECOVERING) {
      beginRecovery(storedApps, storedDrivers, storedWorkers)
      recoveryCompletionTask = forwardMessageThread.schedule(new Runnable {
        override def run(): Unit = Utils.tryLogNonFatalError {
          self.send(CompleteRecovery)
        }
      }, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)
    }
  }
val (storedApps, storedDrivers, storedWorkers) = persistenceEngine.readPersistedData(rpcEnv)

首先会通过持久化引擎读取app。driver和worker得信息

然后判断一下这三个是否为空

然后再调用beginRecovery方法

private def beginRecovery(storedApps: Seq[ApplicationInfo], storedDrivers: Seq[DriverInfo],
    storedWorkers: Seq[WorkerInfo]) {
  for (app <- storedApps) {
    logInfo("Trying to recover app: " + app.id)
    try {
      registerApplication(app)
      app.state = ApplicationState.UNKNOWN
      app.driver.send(MasterChanged(self, masterWebUiUrl))
    } catch {
      case e: Exception => logInfo("App " + app.id + " had exception on reconnect")
    }
  }

  for (driver <- storedDrivers) {
    // Here we just read in the list of drivers. Any drivers associated with now-lost workers
    // will be re-launched when we detect that the worker is missing.
    drivers += driver
  }

  for (worker <- storedWorkers) {
    logInfo("Trying to recover worker: " + worker.id)
    try {
      registerWorker(worker)
      worker.state = WorkerState.UNKNOWN
      worker.endpoint.send(MasterChanged(self, masterWebUiUrl))
    } catch {
      case e: Exception => logInfo("Worker " + worker.id + " had exception on reconnect")
    }
  }
}

这个方法通过依次遍历app,driver和worker进行注册

以worker为例子:

调用registerWorker

private def registerWorker(worker: WorkerInfo): Boolean = {
  // There may be one or more refs to dead workers on this same node (w/ different ID's),
  // remove them.
  workers.filter { w =>
    (w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD)
  }.foreach { w =>
    workers -= w
  }

  val workerAddress = worker.endpoint.address
  if (addressToWorker.contains(workerAddress)) {
    val oldWorker = addressToWorker(workerAddress)
    if (oldWorker.state == WorkerState.UNKNOWN) {
      // A worker registering from UNKNOWN implies that the worker was restarted during recovery.
      // The old worker must thus be dead, so we will remove it and accept the new worker.
      removeWorker(oldWorker)
    } else {
      logInfo("Attempted to re-register worker at same address: " + workerAddress)
      return false
    }
  }

  workers += worker
  idToWorker(worker.id) = worker
  addressToWorker(workerAddress) = worker
  true
}

这个方法中会进行一些过滤动作,然后将worker加入到各种容器中,也就是HashMap,ArrayBuffer等

worker.state = WorkerState.UNKNOWN然后将状态更改成Unknown,发一个信息给worker

再看Worker.scala中

case MasterChanged(masterRef, masterWebUiUrl) =>
  logInfo("Master has changed, new master is at " + masterRef.address.toSparkURL)
  changeMaster(masterRef, masterWebUiUrl)

  val execs = executors.values.
    map(e => new ExecutorDescription(e.appId, e.execId, e.cores, e.state))
  masterRef.send(WorkerSchedulerStateResponse(workerId, execs.toList, drivers.keys.toSeq))

这里会更新原来的excutors信息,然后给予Master相应,

回到Master

case Some(worker) =>
  logInfo("Worker has been re-registered: " + workerId)
  worker.state = WorkerState.ALIVE

  val validExecutors = executors.filter(exec => idToApp.get(exec.appId).isDefined)
  for (exec <- validExecutors) {
    val app = idToApp.get(exec.appId).get
    val execInfo = app.addExecutor(worker, exec.cores, Some(exec.execId))
    worker.addExecutor(execInfo)
    execInfo.copyState(exec)
  }

Master在接受到消息之后会把状态更新成ALIVE。

beginRecvory之后

会启动一个线程,定时去发送CompleteRecovery消息

 recoveryCompletionTask = forwardMessageThread.schedule(new Runnable {
      override def run(): Unit = Utils.tryLogNonFatalError {
        self.send(CompleteRecovery)
      }
    }, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)
  }
}

接着我们看这个方法

completeRecovery()
private def completeRecovery() {
  // Ensure "only-once" recovery semantics using a short synchronization period.
  if (state != RecoveryState.RECOVERING) { return }
  state = RecoveryState.COMPLETING_RECOVERY

  // Kill off any workers and apps that didn't respond to us.
  workers.filter(_.state == WorkerState.UNKNOWN).foreach(removeWorker)
  apps.filter(_.state == ApplicationState.UNKNOWN).foreach(finishApplication)

  // Reschedule drivers which were not claimed by any workers
  drivers.filter(_.worker.isEmpty).foreach { d =>
    logWarning(s"Driver ${d.id} was not found after master recovery")
    if (d.desc.supervise) {
      logWarning(s"Re-launching ${d.id}")
      relaunchDriver(d)
    } else {
      removeDriver(d.id, DriverState.ERROR, None)
      logWarning(s"Did not re-launch ${d.id} because it was not supervised")
    }
  }

  state = RecoveryState.ALIVE
  schedule()
  logInfo("Recovery complete - resuming operations!")
}

首先会过滤掉一些无响应的worker和app

然后调用relaunchDriver重新调度driver

最后调用schedule()方法,这个方法在第一节中讲过,这里不说了,后面会有详细介绍。

-----------------

上面的源码说明中已经涉及到Master上注册worker、app和driver得代码。这里用一张图来描述

 

---------------

上面的分析和之前第一节的分析我们注意到一个非常重要的方法schedule(),接下去我们会探讨这个方法

private def schedule(): Unit = {
  if (state != RecoveryState.ALIVE) { return }
  // Drivers take strict precedence over executors
  val shuffledWorkers = Random.shuffle(workers) // Randomization helps balance drivers
  for (worker <- shuffledWorkers if worker.state == WorkerState.ALIVE) {
    for (driver <- waitingDrivers) {
      if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
        launchDriver(worker, driver)
        waitingDrivers -= driver
      }
    }
  }
  startExecutorsOnWorkers()
}

 首先是调用 Random.shuffle(workers) 方法

该方法会将存在注册在master上的所有worker进行打乱顺序。

然后再for循环遍历每一个Alive状态的worker,调度driver,注意:这种只有在yarn-cluster模式下才会有。

在worker上的资源够用的情况下回调度waitingDrivers容器中的每一个driver

private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
  logInfo("Launching driver " + driver.id + " on worker " + worker.id)
  worker.addDriver(driver)
  driver.worker = Some(worker)
  worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
  driver.state = DriverState.RUNNING
}
launchDriver方法中会让worker和drvier互相引用,然后发送LaunchDrvier消息给worker,这个会下在一讲worker的时候涉及。并且把driver的状态更新成running
startExecutorsOnWorkers()这个方法是调度application
private def startExecutorsOnWorkers(): Unit = {
  // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
  // in the queue, then the second app, etc.
  for (app <- waitingApps if app.coresLeft > 0) {
    val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor
    // Filter out workers that don't have enough resources to launch an executor
    val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
      .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
        worker.coresFree >= coresPerExecutor.getOrElse(1))
      .sortBy(_.coresFree).reverse
    val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

    // Now that we've decided how many cores to allocate on each worker, let's allocate them
    for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
      allocateWorkerResourceToExecutors(
        app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
    }
  }
}

首先会遍历waitingApps容器中的applcation,有一个守护条件就是还需要进行cpu分配的application

val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
  .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
    worker.coresFree >= coresPerExecutor.getOrElse(1))
  .sortBy(_.coresFree).reverse

这句话会过滤出可以使用的worker,具体过滤条件就不详细说明了

val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

然后通过这个方法获取到可以分配的cpu个数

for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
  allocateWorkerResourceToExecutors(
    app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
}

为每一个worker上的excutor分配cpu资源,守护条件是可用cpu大于0

allocateWorkerResourceToExecutors这个方法是为每一个worker启动executor

进入allocateWorkerResourceToExecutors方法

for (i <- 1 to numExecutors) {
  val exec = app.addExecutor(worker, coresToAssign)
  launchExecutor(worker, exec)
  app.state = ApplicationState.RUNNING
}

针对每一个executor,把executor和app进行引用。然后调用

launchExecutor
worker.addExecutor(exec)
worker.endpoint.send(LaunchExecutor(masterUrl,
  exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
exec.application.driver.send(
  ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))

内部做了三件事情,一件:worker和exector互相引用,二:给worker发送LaunchExecutor消息,三:给driver发送ExecutorAdded消息。

 

回到allocateWorkerResourceToExecutors

最后悔吧app的状态更改成RUNNING

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值