Spark Master 如何分配集群资源?

         本文以Spark 1.6 源码为例,解读Spark Master 如何分配集群资源。每次Master receive到Worker发送Register worker 消息请求、Client 发送Register driver 请求、和 Register Application和LaunchExecutor等请求、还有结束释放driver和Executor等都会对集群资源进行再分配,那么Master是怎么进行资源分配的,让我们剥笋抽茧蹭蹭深入解读。整个资源调度函数入口为Master类的方法 schedule()方法,代码如下:

/**
 * Schedule the currently availableresources among waiting apps. This method will be called
 * every time a new app joins or resourceavailability changes.
 */
private def schedule(): Unit = {
  
if (state != RecoveryState.ALIVE) { return } //不是active master不操作
  
// Drivers take strict precedence overexecutors,优先分配driver的资源
  
val shuffledWorkers = Random.shuffle(workers// Randomization helps balance drivers,

//随机打散,负载均衡

//轮询打散的Worker,针对每台Worker,按照driver的注册顺序,为每个driver尝试分配资源
  for (worker <- shuffledWorkers if worker.state == WorkerState.ALIVE) {//活的Worker分配

//资源
    
for (driver <- waitingDrivers) {//遍历每个需要获取资源的driver,按照先入先出FIFO
      
if (worker.memoryFree >= driver.desc.mem&& worker.coresFree >= driver.desc.cores) { //如果当前Worker的预留资源(CPU+Memory)满足当前driver所需资源要求,就将driver所需的资源分  

//配给driver
        launchDriver(worker
driver) //启动当前driver
        
waitingDrivers -= driver //正在等待分配资源的driver列表中移除当前获取到资源的driver
      }
    }
  }
  startExecutorsOnWorkers() //
开始Worker上分配executor
}

分配资源给driver,代码如下:

private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
  logInfo("Launching driver " + driver.id + " on worker " + worker.id)
  worker.addDriver(driver) 
  driver.worker = Some(worker) 
//Master端的Worker EndpointRef 发送消息给对应的Worker,Worker启动dirver
  worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
  driver.state = DriverState.RUNNING //同时标记当前分配资源好的driver正在运行
}

 

//workerinfo 类代码
def addDriver(driver: DriverInfo) {
  drivers(driver.id) = driver //加入当前Worker已经运行的driver的集合
  memoryUsed += driver.desc.mem //更新正使用的CPUMemory信息
  coresUsed += driver.desc.cores
}

到此为止资源本轮分配给driver的任务已经搞定了。总体上,Master优先分配给Driver资源,每次调用schedule()方法就会遍历所有正在等待分配资源的Waiting-Driver,针对每个Waiting-Driver,从shuffled Workers 中依次选取Worker,看当前Worker空闲的资源是否满足启动当前Worker所需资源,如果满足就在这个Worker上分配资源给当前Driver,并运行当前driver;如果遍历完后,发现没有一个Worker满足资源要求,就继续等待到下一轮schedule。

        接下来轮到分配资源给Executor了,代码如下:

/**
 * Schedule and launch executors on workers
 */
private def startExecutorsOnWorkers(): Unit = {
  // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
  // in the queue, then the second app, etc. 
//哪个Application先注册,就先把资源尽可能分配给该Application ,按照App注册顺序调度
  //coresLef 还需要多少CPUMaster分配coreapp不是一次配齐
for (app <- waitingApps if app.coresLeft > 0) { 
    val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor//每个Executor需要多少core
    // Filter out workers that don't have enough resources to launch an executor
//所有已注册的Worker中满足最少能申请到一个Executor的条件按照可分配资源由小到大排序
    val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
      .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
        worker.coresFree >= coresPerExecutor.getOrElse(1))
      .sortBy(_.coresFree).reverse 
//计算可分配资源的Worker此次分配资源的额数目(有可能最终调度还是分配不了资源)
    val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
    // Now that we've decided how many cores to allocate on each worker, let's allocate them
    //分配过程是对每个App,一下满足其所有Executor的资源要求
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
//对每个实际可以分配资源的Worker分配资源给executor
      allocateWorkerResourceToExecutors(
        app, assignedCores(pos), coresPerExecutor, usableWorkers(pos)) //正式分配资源
    }
  }
}

先解读正式分配资源给executor的代码,如下:

/**
 * Allocate a worker's resources to oneor more executors.
 * @param app the info of theapplication which the executors belong to
 * @param assignedCores number ofcores on this worker for this application
 * @param coresPerExecutor numberof cores per executor
 * @param worker the worker info
 */
private def allocateWorkerResourceToExecutors(
    app: ApplicationInfo
,
    
assignedCores: Int,
    
coresPerExecutor: Option[Int],
    
worker: WorkerInfo): Unit = {
  
//计算当前Worker总共可以分配多少个executor, 可分配的核数/每个executor多少核,

//如果没有指定每个executor多少核数,那么就在当前worker上只分配一个executor,分配核数为当前Worker

//本轮所能分配的

val numExecutors = coresPerExecutor.map {assignedCores / _ }.getOrElse(1)

//每个executor分配核数,如果没有指定,就是总共可以分配的核数
  
val coresToAssign =coresPerExecutor.getOrElse(assignedCores)
  
for (i <- to numExecutors) {
    
val exec = app.addExecutor(workercoresToAssign)
    launchExecutor(worker
exec)
    app.
state = ApplicationState.RUNNING //app 分配到一个executor,就开始跑
  
}
}

 

//Applicationinfo class

private[master] def addExecutor(
    worker: WorkerInfo
,
    
cores: Int,
    
useID: Option[Int] = None): ExecutorDesc = {
  
val exec = new ExecutorDesc(newExecutorId(useID)thisworkercoresdesc.memoryPerExecutorMB)
  
executors(exec.id) = exec //加入当前APP Executor集合
  coresGranted += cores 
  exec
}

 

//masterclass

privatedef launchExecutor(worker: WorkerInfoexec: ExecutorDesc): Unit = {
  logInfo(
"Launching executor " + exec.fullId + " on worker " + worker.id)
  worker.addExecutor(exec)

//发送消息给Worker,让其落实分配资源并启动Executor
  worker.endpoint.send(LaunchExecutor(masterUrl,
    
exec.application.idexec.idexec.application.descexec.coresexec.memory))
   
//发送消息给driver,让其更新所有executor元数据信息

exec.application.driver.send(ExecutorAdded(
    exec.id
worker.idworker.hostPortexec.coresexec.memory)) 
}

         接下来看资源分配重点内容:Master如何综合整个集群的空闲资源进行Executor资源分配工作,代码如下:

/**
 * Schedule executors to be launched onthe workers.
 * Returns an array containing number ofcores assigned to each worker.
 * There are two modes of launchingexecutors.

1. The firstattempts to spread out an application's executors on as many workers aspossible,

2. whilethe second does the opposite (i.e. launch them on as few workers as possible).The former is usually better for data locality purposes and is the default.

 The number of cores assigned to each executoris configurable. When this is explicitly set, multiple executors from the sameapplication may be launched on the same worker if the worker has enough coresand memory. Otherwise, each executor grabs all the cores available on the workerby default, in which case only one executor may be launched on each worker. Itis important to allocate coresPerExecutor on each worker at a time (instead of1 core at a time).

Considerthe following example: cluster has 4 workers with 16 cores each. User requests3 executors (spark.cores.max = 48, spark.executor.cores = 16). If 1 core is allocatedat a time, 12 cores from each worker would be assigned to each executor. Since12 < 16, no executors would launch [SPARK-8881].
 */
private def scheduleExecutorsOnWorkers(
    app: ApplicationInfo
,
    
usableWorkers: Array[WorkerInfo],
    
spreadOutApps: Boolean): Array[Int] = {
  
val coresPerExecutor =app.desc.coresPerExecutor //每个exectuor分配多少核数
  val minCoresPerExecutor = coresPerExecutor.getOrElse(1//没有配置,则exe最少1个核
  val oneExecutorPerWorker = coresPerExecutor.isEmpty //是否每个Worker分配一个Executor
  val memoryPerExecutor = app.desc.memoryPerExecutorMB //每个exe分配多少MB内存
  val numUsable = usableWorkers.length //可分配资源的Worker数,满足分配条件按照空闲资源从小到大
  val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
  val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
  //app还需多少core资源和可分配资源Worker所有可分配核数之和的最小值

var coresToAssign = math.min(app.coresLeftusableWorkers.map(_.coresFree).sum)


  
/** Return whether the specified worker canlaunch an executor1个) for this app. */
  
def canLaunchExecutor(pos: Int): Boolean = {

//如果app还需要核数大于每个exe最少核数,也就是需在分配executor
    val keepScheduling = coresToAssign >=minCoresPerExecutor

//当前Worker总共可分配的核数减去已经分配的核数大于分配一个exe的最小合数,说明可以分配
    val enoughCores = usableWorkers(pos).coresFree- assignedCores(pos) >= minCoresPerExecutor
    
// If we allow multiple executors perworker, then we can always launch new executors.
    // Otherwise, if there is already anexecutor on this worker, just give it more cores.
   

//如果并不是worker只能分配一个executor,也就是没有指定executor核数的情况,或者当前worker针对

//当前app此次schedule还没有分配一个Executor,则可以分配新的executor

val launchingNewExecutor =!oneExecutorPerWorker || assignedExecutors(pos) == 0
    
if (launchingNewExecutor) {  //memorycore 同时满足要求才给分配
      
val assignedMemory = assignedExecutors(pos) *memoryPerExecutor

//如果worker free memory减去已经分配的memory大于一个exe需要的memory,则内存足够
      val enoughMemory =usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor

//如果即将分配给当前Appexe数加上已经分配了的exe数目小于其最大exe数限制,标明还有分配空间
      val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
     
 //4个条件都符合,说明当前Worker可以分配一个ExecutorApp

keepScheduling&& enoughCores && enoughMemory && underLimit
    } 
else {
      
// We're adding cores to an existingexecutor, so no need to check memory and executor limits

//增加核数到executor,不增加executor
      keepScheduling && enoughCores
    }
  }

  
// Keep launching executors until no moreworkers can accommodate any
  // more executors, or if we havereached this application's limits
  var freeWorkers = (untilnumUsable).filter(canLaunchExecutor) //可以分配资源的Worker
  
while (freeWorkers.nonEmpty) {
    freeWorkers.foreach { pos =>
      
var keepScheduling = true //设置可调度值为真
      while (keepScheduling &&canLaunchExecutor(pos)) {
        coresToAssign -=minCoresPerExecutor 
//总共还需分配核数减去当前分配的核数
        assignedCores(pos) += minCoresPerExecutor//当前Worker此轮已经分配分配核数叠加
        if (oneExecutorPerWorker) {
          assignedExecutors(pos) = 
1 //如果一个Worker,一个Executor,那么当前Worker分配结束
        
else {
          assignedExecutors(pos) += 
1
        
}
       
 //Spreading out an application means spreading out its executors across as
        // many workers as possible. Ifwe are not spreading out, then we should keep
        // scheduling executors on thisworker until we use all of its resources.
        // Otherwise, just move on to thenext worker.
        if (spreadOutApps) { //appexecutor尽可能把分布到多个worker
          keepScheduling = false
        
}
      }
    }
    freeWorkers =freeWorkers.filter(canLaunchExecutor) 
//下一轮分配
  }
  assignedCores
}

         总结一下,Master每次schedule对所有registered app按照FIFO依次分配资源给app,对每个app采用轮询分配的方法对所有Worker free资源最少满足分配一个Executor所需资源的Worker轮询分配,分配资源的方法有三种:

1、SpreadOut方法(默认)

对每台能力分配executor的Worker每次迭代分配只一个Executor,直到轮询到的Worker无法满足分配一个executor所需的资源的需求;

2、非SpreadOut方法

每次迭代对每台有能力分配executor的Worker穷其可分配的资源尽可能分配多的executor;

3、Worker-Per-Executor 方法

如果app没有指定每个executor的核数,则默认每个Worker上最多只分配一个executor,迭代轮询资源满足分配需求的Worker,轮询到当前Worker,如果没有分配executor,则分配一个executor。

Master采用FIFO分配集群资源的方式,可能诱发大任务长期占用资源,而小任务因不能够及时获取到资源而长时间等待的问题,所以该调度不够精细,可以借鉴操作系统资源调度方案。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值