spark源码(四)--master资源调度

最新推荐文章于 2024-08-03 11:07:23 发布

山高水长~

最新推荐文章于 2024-08-03 11:07:23 发布

阅读量273

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/weixin_44024821/article/details/95641306

版权

spark 专栏收录该内容

16 篇文章 1 订阅

订阅专栏

master在处理完worker、application、driver的注册请求之后，就开始对组件资源进行调度完成spark任务。

我们再确认一遍这个过程：首先集群启动之后，Worker会向Master注册，同时携带身份标识和资源情况（如ID,host，port,cpu核数，内存大小），那么这些资源交由Master纳管后，Master会按照一定的资源调度策略分配给Driver和Application。
Master给Driver分配完资源后，将会向Worker发送启动Driver命令，Worker接收到命令后，开始启动Driver。
Master给Application分配完资源后，将向Worker发送启动Executor命令，Worker接收到命令后，开始启动Executor。

整个过程中master的作用包括：
1.Worker的管理
2.Application的管理
3.Driver的管理
4.接收各个Worker的注册，状态更新，心跳
5.Driver和Application的注册
6.统一管理和分配集群中的资源（如内存和cpu）
说到管理和分配集群中的资源，spark中的资源除了常见的内存和cpu，个人认为worker和executor同样算是资源，因为master资源调度的核心方法schedule()方法中解释说，任何集群资源的变动都会激活schedule()，毫无疑问，worker和executor的变动绝对会执行schedule()

接下来总结下master中会触发schedule()的事件：
completeRecovery --master主备切换
RegisterWorker
RegisterApplication
removeApplication
RequestSubmitDriver
relaunchDriver
removeDriver
handleRequestExecutors
ExecutorStateChanged
handleKillExecutors
值得注意的是removeWorker这个事件并没有直接触发schedule()方法，因为removeWorker()方法的核心是移除worker上的driver、executor。

现在，就正式看一下master资源调度的核心方法schedule()的源码。

/**
   * Schedule the currently available resources among waiting apps. This method will be called every time a new app joins or resource availability changes.
   * 给等待的app安排当前可用的资源，每当有新的app加入或者资源发生了改变这个方法就会被调用
   */
  private def schedule(): Unit = {
    if (state != RecoveryState.ALIVE) {
      return
    }
    // Drivers take strict precedence over executors   driver严格的在executor之前
    // 将alive的worker随机打散以做到负载均衡，并获取可用worker数量
    val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
    val numWorkersAlive = shuffledAliveWorkers.size
    var curPos = 0
    // 因为master的资源调度最终目的是为了运行一个application，application与driver一一对应，所以要首先调度driver
    for (driver <- waitingDrivers.toList) {
      // 我们通过轮询的方式分发driver到worker上，对每个driver，从上一个已经分发到driver的worker上开始，一直分发到所有活着的worker上
      var launched = false
      var numWorkersVisited = 0
          // 只要还有可用的worker没有遍历到，并且driver还没有启动，就一直循环下去
          while (numWorkersVisited < numWorkersAlive && !launched) {
            val worker = shuffledAliveWorkers(curPos)
            numWorkersVisited += 1
            // 如果这个worker空闲内存大于driver需要的内存，空闲cpu大约driver需要的cpu core
            if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
              // 就在这个worker上启动driver 将这个driver从等待队列中移出，并且将状态修改为已启动
              launchDriver(worker, driver)
              waitingDrivers -= driver
          launched = true  // 修改driver状态(是否跳出while循环)
        }
        // 然后指针指向下一个worker
        curPos = (curPos + 1) % numWorkersAlive
      }
    }
    // 在worker上启动driver之后，就需要在worker上启动运行这个程序的executor
    startExecutorsOnWorkers()
  }

master调度driver总结：拿到waitingDrivers中的driver去轮询worker，直到找到一个空闲内存大于driver所需内存，空闲cpu大于大于driver所需cpu的worker，然后在这个worker上启动driver。这段逻辑写的挺好

接下来看看worker是怎么为driver启动executor的

/**
   * Schedule and launch executors on workers
   */
  private def startExecutorsOnWorkers(): Unit = {
  // 启动executor是一个FIFO的规则，首先为队列中第一个app启动executor，然后是第二个
    for (app <- waitingApps) {
      val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1)
     // If the cores left is less than the coresPerExecutor,the cores left will not be allocated  如果剩下的cpu小于coresPerExecutor，则不会分配剩余的cpu了
      if (app.coresLeft >= coresPerExecutor) {
        // 过滤出来有效的worker 也就是worker空闲内存大于这个app所设置的executor的内存和cpu 然后按照worker上空闲cpu倒叙排序
        val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
          .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
            worker.coresFree >= coresPerExecutor)
          .sortBy(_.coresFree).reverse
        // (重要) 以spreadOutApps算法确定要给这个app在每个worker上分配多少cpu
        val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
        // 这里就已经获取了每个worker上分配了多少个cpu core了
        for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
         // 将worker资源分发到executor
          allocateWorkerResourceToExecutors(
            app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
        }
      }
    }
  }

worker上启动executor小结：worker上启动多少executor的取决于两个因素 1.有多少个worker有资格为这个application启动executor，也就是多少个worker的资源大于这个app对应executor需要的资源 2.很重要的一点，spark集群是先根据app和它可用worker去分配cpu，然后再根据cpu的总数量和app所规定每个executor占用cpu数去决定启动几个executor

我们要看一下cpu分配算法了 scheduleExecutorsOnWorkers：

  /**
   * Schedule executors to be launched on the workers. Returns an array containing number of cores assigned to each worker.
   * 调度在worker上启动的executor，该方法返回的是每个worker上分配了多少个cpu数
   * 
   * There are two modes of launching executors. The first attempts to spread out an application's
   * executors on as many workers as possible, while the second does the opposite (i.e. launch them
   * on as few workers as possible). The former is usually better for data locality purposes and is
   * the default.
   * 重要！！！ 有两种启动executor的模式：第一种尝试尽可能均匀分配每个application的executor到多个worker上，这种方式会有更好的数据本地性，所以是默认的；第二种是尽可能把executor启动在尽量少的worker上
   *
   * The number of cores assigned to each executor is configurable. When this is explicitly set,
   * multiple executors from the same application may be launched on the same worker if the worker
   * has enough cores and memory. Otherwise, each executor grabs all the cores available on the
   * worker by default, in which case only one executor per application may be launched on each
   * worker during one single schedule iteration.
   * 每个executor占用多个cpu是可配置的，当我们设置了这个值，假如worker有足够多的资源那么一个app的多个executoer可能会启动在同一个worker上，如果没设置这个值，那么一个executor就会尽可能占用一个worker上所有可用的cpu，这样的话，一个app就只会在一个worker上启动一个executor了
   */
   
  private def scheduleExecutorsOnWorkers(
      app: ApplicationInfo,
      usableWorkers: Array[WorkerInfo],
      spreadOutApps: Boolean): Array[Int] = {
    val coresPerExecutor = app.desc.coresPerExecutor   
    val minCoresPerExecutor = coresPerExecutor.getOrElse(1) // 每个executor最小cpu数，如果设置就获取，没设置就默认为1
    // 如果没有设置每个executor占用多少cpu 那么executor会尽可能抓取这个worker上所有空闲cpu，所以这个worker就只会有一个executor
    val oneExecutorPerWorker = coresPerExecutor.isEmpty
    val memoryPerExecutor = app.desc.memoryPerExecutorMB   	// 这个application的每个executor所需要的内存
    val numUsable = usableWorkers.length   	// 可用的worker数 
    val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker  每个worker上分发的cpu core，长度固定为worker数 当前是个空数组
    val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker   每个worker上分发的executor，长度固定为worker数 当前是个空数组
    // 待分发的cpu core 取这个application还需分配的core 和 可用worker上剩余空闲core 的最小值
    var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)

    // 返回这个worker是否能为这个application启动一个executor，结果是boolean值
    def canLaunchExecutor(pos: Int): Boolean = {
      // 定义状态 如果待分发（可用）core大于每个executor需要的core 就一直启动executor
      val keepScheduling = coresToAssign >= minCoresPerExecutor
     // 定义状态 如果这个可用worker的空闲cpu-已经分发的cpu还大于每个executor需要的cpu 那么cpu就是充足的
      val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor
      // 如果我们设置一个worker上可以启动多个executor那么我们就可以启动新的executor,否则我们只能给已存在的这个executor增加cpu
      val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0
      if (launchingNewExecutor) {
        val assignedMemory = assignedExecutors(pos) * memoryPerExecutor  // 计算该worker上已花费的内存
        // 如果这个worker空闲内存-启动这个executor的内存还大于每个executor需要的内存，那就这个worker内存充足
        val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor
        val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
        keepScheduling && enoughCores && enoughMemory && underLimit
      } else {
        // We're adding cores to an existing executor, so no need to check memory and executor limits
        // 如果这个worker上不能启动新executor 就不需要检查内存和executor数量限制了
        keepScheduling && enoughCores
      }
    }

    var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)
    // 只要还有worker能够启动executor，就一直启动下去。注意这是个双重循环
    while (freeWorkers.nonEmpty) {
      freeWorkers.foreach { pos =>
        var keepScheduling = true
        while (keepScheduling && canLaunchExecutor(pos)) {
          // 待分配的core 减去每个executor需要的最少cpu，这个worker上已经分配的cpu加上executor需要的cpu
          coresToAssign -= minCoresPerExecutor
          assignedCores(pos) += minCoresPerExecutor

          // If we are launching one executor per worker, then every iteration assigns 1 core
          // to the executor. Otherwise, every iteration assigns cores to a new executor.
          if (oneExecutorPerWorker) {
            assignedExecutors(pos) = 1
          } else {
            assignedExecutors(pos) += 1
          }

           // spreadOutApps算法尽可能将executor均匀分布在不同的worker上，如果是非spreadOutApps，我们就在一个worker上调度executor，知道用完这个worker的资源
          // spreadOutApps：当前worker分配cpu后，跳出循环，遍历下一个worker
          if (spreadOutApps) {
            keepScheduling = false
          }
        }
      }
      freeWorkers = freeWorkers.filter(canLaunchExecutor)  // 是否还有空闲资源的worker，没有了就跳出大的while循环
    }
    assignedCores  // 返回每个worker上分配的cpu数量
  }

worker上分配cpu小结：startExecutorsOnWorkers()方法分为三部分，第一部分是基本变量的定义，第二部分定义了worker上能够启动executor的方法，第三部分是worker上分配cpu的核心算法–若是spreadOutApps算法，就将运行app需要的cpu均匀分配到不同的workre上去。
这里还要再捋一次cpu、executor、app的关系：首先worker上启动的所有executor都是为这个app服务的，而每个worker上确定启动几个executor取决于这个worker上分配了多少cpu给这个app；反过来说，spreadOutApps算法每次给worker分配的cpu数都是minCoresPerExecutor个，也就是启动一个executor需要的cpu数，这就决定了executor会分布到不同的worker上去，最终也就是运行这个app的任务会均匀落到不同的worker中。

现在就该看看每个worker上到底启动多少executor了

 /**
   * Allocate a worker's resources to one or more executors.
   * @param app the info of the application which the executors belong to
   * @param assignedCores number of cores on this worker for this application
   * @param coresPerExecutor number of cores per executor
   * @param worker the worker info
   */
  private def allocateWorkerResourceToExecutors(
      app: ApplicationInfo,
      assignedCores: Int,
      coresPerExecutor: Option[Int],
      worker: WorkerInfo): Unit = {
	// If the number of cores per executor is specified, we divide the cores assigned to this worker evenly among the executors with no remainder.
	// Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
	// 如果指定了每个executor的核数么，那么我们就按照这个数量给每个executor分配，不留余数。
	// 如果没有指定，那么所有的cpu 都分配给一个executor
    val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
    val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
    for (i <- 1 to numExecutors) {
     // 将executor加入application的内存缓存
      val exec = app.addExecutor(worker, coresToAssign)
      launchExecutor(worker, exec)
      app.state = ApplicationState.RUNNING
    }
  }

最后一步，master通知worker启动executor

  private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
   // 将executor加入worker内存缓存
    worker.addExecutor(exec)
    // 向worker发送消息，要求worker启动这个executor
    worker.endpoint.send(LaunchExecutor(masterUrl,
      exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
    // 向这个application的driver发送消息，将executor相关信息加入driver内存
    exec.application.driver.send(
      ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
  }

到此为止，master的shchedul()方法就全部走完了，一句话来说，就是每个worker需要为不同的app分别启动多少executor。
大致总结下master资源分配的步骤：
1.以轮询的方式在worker上启动driver
2.master启动之后在worker上启动executor
a.以spreadout算法将executor均匀分布在worker上，得到每个worker应该为这个application分配多少cpu
b.根据这些cpu的数量去启动相对应数量的executor