yarn3.2 源码分析之FairScheduler连续调度和assignContainer流程

概述

FairScheduler分配container的核心调度流程

核心调度流程如下:

  1. 调度器锁住FairScheduler对象,避免核心数据结构冲突。
  2. 调度器选取集群的一个节点(node),从树形队列的根节点ROOT开始出发,每层队列都会按照公平策略选择一个子队列,最后在叶子队列按照公平策略选择一个App,为这个App在node上找一块适配的资源。

对于每层队列进行如下流程:

  1. 队列预先检查:检查队列的资源使用量是否已经超过了队列的Quota
  2. 排序子队列/App:按照公平调度策略,对子队列/App进行排序
  3. 递归调度子队列/App

例如,某次调度的路径是ROOT -> ParentQueueA -> LeafQueueA1 -> App11,这次调度会从node上给App11分配Container。

FairScheduler架构

公平调度器是一个多线程异步协作的架构,而为了保证调度过程中数据的一致性,在主要的流程中加入了FairScheduler对象锁。其中核心调度流程是单线程执行的。这意味着Container分配是串行的,这是调度器存在性能瓶颈的核心原因。

  • scheduler Lock:FairScheduler对象锁
  • AllocationFileLoaderService:负责公平策略配置文件的热加载,更新队列数据结构
  • Continuous Scheduling Thread:开启连续调度时的核心调度线程,不停地执行分配container的核心调度流程。
  • Update Thread:更新队列资源需求,执行Container抢占流程等
  • Scheduler Event Dispatcher Thread: 调度器事件的处理器,处理App新增,App结束,node新增,node移除等事件

FairScheduler的资源调度方式

 FairScheduler支持2种资源调度方式:心跳调度和连续调度。

心跳调度方式:NodeManager向ResourceManager汇报了自身资源情况(比如,当前可用资源,正在使用的资源,已经释放的资源),这个RPC会触发ResourceManager调用nodeUpdate()方法,这个方法为这个节点进行一次资源调度,即,从维护的Queue中取出合适的应用的资源请求(合适 ,指的是这个资源请求既不违背队列的最大资源使用限制,也不违背这个NodeManager的剩余资源量限制)放到这个NodeManager上运行。这种调度方式一个主要缺点就是调度缓慢,当一个NodeManager即使已经有了剩余资源,调度也只能在心跳发送以后才会进行,不够及时。

连续调度方式:由一个独立的线程ContinuousSchedulingThread负责进行持续的资源调度,与NodeManager的心跳是异步进行的。即不需要等到NodeManager发来心跳才开始资源调度。 

FairSharePolicy的比较器

FariSchaeduler根据FairSharePolicy的比较器,对队列/app进行排序。

两个组, 排序的规则是:

1. 一个需要资源, 另外一个不需要资源, 则需要资源的排前面

2. 若都需要资源的话, 对比 使用的内存占minShare的比例, 比例小的排前面, (即尽量保证达到minShare)

3. 若比例相同的话, 计算出使用量与权重的比例, 小的排前面, 即权重大的优先, 使用量小的优先.

4. 若还是相同, 提交时间早的优先, app id小的排前面.

/**
   * Compare Schedulables mainly via fair share usage to meet fairness.
   * Specifically, it goes through following four steps.
   *
   * 1. Compare demands. Schedulables without resource demand get lower priority
   * than ones who have demands.
   * 
   * 2. Compare min share usage. Schedulables below their min share are compared
   * by how far below it they are as a ratio. For example, if job A has 8 out
   * of a min share of 10 tasks and job B has 50 out of a min share of 100,
   * then job B is scheduled next, because B is at 50% of its min share and A
   * is at 80% of its min share.
   * 
   * 3. Compare fair share usage. Schedulables above their min share are
   * compared by fair share usage by checking (resource usage / weight).
   * If all weights are equal, slots are given to the job with the fewest tasks;
   * otherwise, jobs with more weight get proportionally more slots. If weight
   * equals to 0, we can't compare Schedulables by (resource usage/weight).
   * There are two situations: 1)All weights equal to 0, slots are given
   * to one with less resource usage. 2)Only one of weight equals to 0, slots
   * are given to the one with non-zero weight.
   *
   * 4. Break the tie by compare submit time and job name.
   */
  private static class FairShareComparator implements Comparator<Schedulable>,
      Serializable {
    private static final long serialVersionUID = 5564969375856699313L;

    @Override
    public int compare(Schedulable s1, Schedulable s2) {
      int res = compareDemand(s1, s2);

      // Share resource usages to avoid duplicate calculation
      Resource resourceUsage1 = null;
      Resource resourceUsage2 = null;

      if (res == 0) {
        resourceUsage1 = s1.getResourceUsage();
        resourceUsage2 = s2.getResourceUsage();
        res = compareMinShareUsage(s1, s2, resourceUsage1, resourceUsage2);
      }

      if (res == 0) {
        res = compareFairShareUsage(s1, s2, resourceUsage1, resourceUsage2);
      }

      // Break the tie by submit time
      if (res == 0) {
        res = (int) Math.signum(s1.getStartTime() - s2.getStartTime());
      }

      // Break the tie by job name
      if (res == 0) {
        res = s1.getName().compareTo(s2.getName());
      }

      return res;
    }
  }

 心跳调度源码分析

 略,请参阅博客:yarn3.2源码分析之NM与RM通信完成心跳调度

 连续调度源码分析 

  ContinuousSchedulingThread

 用于连续调度的线程。连续调度默认不开启,只有设置yarn.scheduler.fair.continuous-scheduling-enabled参数为true,才会启动该线程。连续调度现在已经不推荐了,因为它会因为锁的问题,而导致资源调度变得缓慢。可以使用yarn.scheduler.assignmultiple参数启动批量分配功能,作为连续调度的替代品。

 /**
   * Thread which attempts scheduling resources continuously,
   * asynchronous to the node heartbeats.
   */
  @Deprecated
  private class ContinuousSchedulingThread extends Thread {

    @Override
    public void run() {
      while (!Thread.currentThread().isInterrupted()) {
        try {
          continuousSchedulingAttempt();
          Thread.sleep(getContinuousSchedulingSleepMs());
        } catch (InterruptedException e) {
          LOG.warn("Continuous scheduling thread interrupted. Exiting.", e);
          return;
        }
      }
    }
  }

 continuousSchedulingAttempt()方法


                
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值