spark(二)-Master的资源调度

最新推荐文章于 2022-07-03 20:47:57 发布

Rover Ramble

最新推荐文章于 2022-07-03 20:47:57 发布

阅读量276

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/rover2002/article/details/106054133

版权

Spark 专栏收录该内容

21 篇文章 4 订阅

订阅专栏

Master在收到RegisterWorker/RegisterApplication/ExecutorStateChanged/RequestSubmitDriver消息时，
或者完成主备切换后，都要执行schedule()，来

调度waiting apps中当前可用的资源(Driver,Executor on Workers)，
每当加入一个新的app，或者资源可用性改变时都会被调用。

  private def schedule(): Unit = {
    // Standby master是不会参与Application资源的调度的
    if (state != RecoveryState.ALIVE) {
      return
    }
    // Drivers take strict precedence over executors
    // workers 是registerWorker()注册过来的所有worker，类型是HashSet[WorkerInfo]
    // 先对workers随机乱序
    val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
    val numWorkersAlive = shuffledAliveWorkers.size
    var curPos = 0
    for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
      // We assign workers to each waiting driver in a round-robin fashion. For each driver, we
      // start from the last worker that was assigned a driver, and continue onwards until we have
      // explored all alive workers.
      var launched = false
      var numWorkersVisited = 0
      // 遍历所有Alive状态的workers
      while (numWorkersVisited < numWorkersAlive && !launched) {
        val worker = shuffledAliveWorkers(curPos)
        numWorkersVisited += 1
        // 如果当前worker上的空闲内存和空闲CPU cores能够满足driver，就launch它，并从waitingDrivers中移除
        if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
          launchDriver(worker, driver)
          waitingDrivers -= driver
          launched = true
        }
        curPos = (curPos + 1) % numWorkersAlive
      }
    }
    // 启动Driver后，启动Worker上的所有Executors
    startExecutorsOnWorkers()
  }

standalone和yarn-client模式都会在本地启动Driver，不会注册给Master，
只有yarn-cluster模式才会注册进来，才需要调度Drivers。

启动Driver严格优先于Executors。

启动Driver过程

1, 将driver加入对应的worker，worker加入到driver内存缓存，互相引用；
2, 调用worker的endpoint发送LaunchDriver消息，让Worker来启动Driver
3, 更新driver的状态为RUNNING

Worker收到LaunchDriver消息后，new 一个 DriverRunner()并启动，累计占用的cpu和memory。

DriverRunner在org.apache.spark.deploy.worker下。它实际上new Thread并启动，
先Kill掉之前的driver，把Kill()添加给ShutdownHook；
然后准备driver的jars开始运行；
运行后发送DriverStateChanged消息通知driver运行的结果状态。

driver运行时，创建一个CommandUtils.buildProcessBuilder()，运行的其实是这个builder

    val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
      driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables)

然后做一下stdout/stderr重定向，
最后runCommandWithRetry()。

startExecutorsOnWorkers()

调度和启动Worker上的所有Executors

spark2.4这里，Application的调度放在了 startExecutorsOnWorkers()里面，使用简单的FIFO调度算法。

遍历waitingApps中的每个app，
如果app还需要的cores小于一个Executor指定的cpu数，就不为app的coresLeft执行调度，不分配新的executor-core。
过滤出还有可供调度的cpu和memory的workers，按剩余cores的大小降序排序，作为usableWorkers
计算所有usableWorkers上要分配多少CPU(用scheduleExecutorsOnWorkers函数)
遍历可用的 Workers，分配资源并执行调度，启动Executors(用allocateWorkerResourceToExecutors函数)。

这个版本的scheduleExecutorsOnWorkers()和allocateWorkerResourceToExecutors()似乎名字取反了，
schedule…()其实是资源分配的计算，allocate才是调度和执行。

app.coresLeft 定义在ApplicationInfo.scala:
private[master] def coresLeft: Int = requestedCores - coresGranted

  private def startExecutorsOnWorkers(): Unit = {
    // Right now this is a very simple FIFO scheduler. 
    for (app <- waitingApps) {
      val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1)
      // If the cores left is less than the coresPerExecutor,the cores left will not be allocated
      if (app.coresLeft >= coresPerExecutor) {
        // Filter out workers that don't have enough resources to launch an executor
        val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
          .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
            worker.coresFree >= coresPerExecutor)
          .sortBy(_.coresFree).reverse
        val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

        // Now that we've decided how many cores to allocate on each worker, let's allocate them
        for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
          allocateWorkerResourceToExecutors(
            app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
        }
      }
    }
  }

scheduleExecutorsOnWorkers()

计算如何调度worker上的Executors

scheduleExecutorsOnWorkers()返回值是assignedCores数组，它记录了给每个worker分配的cpu数。

内部函数 canLaunchExecutor() 用来判断当前Worker是否可以启动一个新的Executor，分配新的cpu资源。

assignedExecutors数组记录了将要在每个Worker上新启动的Executor个数，默认没指定coresPerExecutor时，每个worker上只启动一个executor。

如果是每个Worker下面只能够为当前的应用程序分配一个Executor的话，每次是分配一个Core!
如果是spreadOutApps(默认的情况下)的时候，会尽量使用集群中所有的executors. 每次都会给executor增加一个core。
如果不是spreadOutApps的时候，每次都会给executor增加一个core，会一直循环当前程序的executor上的freeCores,所以会占用本机器上的尽可能多的cores。
参考：https://blog.csdn.net/snail_gesture/article/details/50808239

  private def scheduleExecutorsOnWorkers(
      app: ApplicationInfo,
      usableWorkers: Array[WorkerInfo],
      spreadOutApps: Boolean): Array[Int] = {
    val coresPerExecutor = app.desc.coresPerExecutor
    // 默认情况下一个executor分配一个core
    val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
    val oneExecutorPerWorker = coresPerExecutor.isEmpty
    val memoryPerExecutor = app.desc.memoryPerExecutorMB
    val numUsable = usableWorkers.length
    val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
    val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
    
    // 还要分配的CPU总数，取min(app需要的cpu数量, 所有可用Workers剩余cpu的和)
    // Workers可用的cores不能满足app需要的话，先把可用的分配上。
    var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)

    /** Return whether the specified worker can launch an executor for this app. */
    def canLaunchExecutor(pos: Int): Boolean = {
      val keepScheduling = coresToAssign >= minCoresPerExecutor
      val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor

      // 如果允许每个worker有多个Executors（不指定coresPerExecutor），那么可以一直启动新的executor。
      // 否则，如果这个worker上已经存在Executor, 就给这个Executor分配更多的core。
      val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0
      if (launchingNewExecutor) {
        // assignedMemory：当前worker要分配的Executors * 每个Executor的Memory
        val assignedMemory = assignedExecutors(pos) * memoryPerExecutor
        // 当前worker空闲memory - 当前worker要分配的Executors总的Memory >= 一个executor的内存
        // 意思是：还有足够的内存可以调起至少一个Executor
        val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor
        val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
        keepScheduling && enoughCores && enoughMemory && underLimit
      } else {
        // We're adding cores to an existing executor, so no need
        // to check memory and executor limits
        keepScheduling && enoughCores
      }
    }

    // Keep launching executors until no more workers can accommodate any
    // more executors, or if we have reached this application's limits
    // 不断启动executor，直到不再有Worker可以容纳更多Executor，或者达到了这个Application的要求
    var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)
    while (freeWorkers.nonEmpty) {
      freeWorkers.foreach { pos =>
        var keepScheduling = true
        // 默认spreadOut方式，keepScheduling为false，每个pos只循环一次，每次分配minCoresPerExecutor个core。
        while (keepScheduling && canLaunchExecutor(pos)) {
          // 每次迭代为当前worker增加minCoresPerExecutor个core。
          // 如果在每个worker上只启动一个executor（没有指定spark.executor.cores的情况下），
          // minCoresPerExecutor的值为1，否则就一次分配指定的coresPerExecutor
          coresToAssign -= minCoresPerExecutor
          assignedCores(pos) += minCoresPerExecutor
          
          // 如果我们在每个worker上只启动一个executor,(没有指定spark.executor.cores的情况)，那么assignedExecutors=1；
          // 否则，每次迭代都在worker上再启动一个executor。
          if (oneExecutorPerWorker) {
            assignedExecutors(pos) = 1
          } else {
            assignedExecutors(pos) += 1
          }

          // Spreading out an application means spreading out its executors across as
          // many workers as possible. If we are not spreading out, then we should keep
          // scheduling executors on this worker until we use all of its resources.
          // Otherwise, just move on to the next worker.
          if (spreadOutApps) {
            keepScheduling = false
          }
        }
      }
      freeWorkers = freeWorkers.filter(canLaunchExecutor)
    }
    assignedCores
  }

canLaunchExecutor()主要根据keepScheduling和enoughCores来判断是否为将启动的Executor分配CPU。
keepScheduling是由coresToAssign（app尚未分配的cores）和 minCoresPerExecutor共同决定的。

spreadOut
scheduleExecutorsOnWorkers在计算资源分配的时候传入了spreadOutApps参数，它
根据配置(spark.deploy.spreadOut)决定是否用spreadOut方式的Apps调度，也就是分配计算资源。
默认的spreadOut为true，会把每个app的所有Executor任务展开，平均分布到所有可用的结点上(Round-Robin)，
而不是一个结点不够用时再分布到下一个结点上。

spreadOutApps为true时，每分配一个minCoresPerExecutor，就移动指针到下一个worker，
否则只要当前worker满足条件，就一直累计assignedCores，直到一个worker没有足够资源时再移动指针。

allocateWorkerResourceToExecutors()

遍历所有usableWorkers依次执行allocateWorkerResourceToExecutors()，来分配cores，调度当前App。
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
allocateWorkerResourceToExecutors(
app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
}

对一次 Application的调度来说，

如果app.desc中配置了每个Executor的Core数(coresPerExecutor)，就用计算好的assignedCores除以coresPerExecutor
得到Executor个数，否则就只启动一个Executor，把所有assignedCores在当前一个可用Worker上运行起来。
遍历Executor的个数，
每次为当前App指定worker和Core的个数，启动一个Executor，并修改App的状态为RUNNING.

  private def allocateWorkerResourceToExecutors(
      app: ApplicationInfo,
      assignedCores: Int,
      coresPerExecutor: Option[Int],
      worker: WorkerInfo): Unit = {
    // If the number of cores per executor is specified, we divide the cores assigned
    // to this worker evenly among the executors with no remainder.
    // Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
    // 如果我们在每个worker上只启动一个executor（默认没有指定spark.executor.cores 的情况下），
    // 这个单独的Executor会捕获Worker上所有可用的核数
    val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
    val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
    for (i <- 1 to numExecutors) {
      val exec = app.addExecutor(worker, coresToAssign)
      launchExecutor(worker, exec)
      app.state = ApplicationState.RUNNING
    }
  }

可见App的调度，其实是App的所有Executor的分配资源和启动，均匀分布在可用的worker上。

launchExecutor()把ExecutorDesc添加给worker，
向worker发送LaunchExecutor消息，
最后向app的driver发送ExecutorAdded消息。

  private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
    worker.addExecutor(exec)
    worker.endpoint.send(LaunchExecutor(masterUrl,
      exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
    exec.application.driver.send(
      ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
  }

Rover Ramble

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark(二)-Master的资源调度

Master在收到RegisterWorker/RegisterApplication/ExecutorStateChanged/RequestSubmitDriver消息时，或者完成主备切换后，都要执行schedule()，来调度waiting apps中所有可用的(ALIVE状态的)资源(Driver,Executor on Workers)：private def schedule(): Unit = { // Standby master是不会参与Application资源的调度的
复制链接

扫一扫

专栏目录