yarn3.2 源码分析之FairScheduler连续调度和assignContainer流程

zhifeng687

已于 2022-03-06 19:20:03 修改

阅读量1.7k

点赞数

分类专栏： Yarn 文章标签： zookeeper java 分布式

于 2019-04-15 11:42:46 首次发布

本文链接：https://blog.csdn.net/qq_26222859/article/details/89308976

版权

概述

FairScheduler分配container的核心调度流程

核心调度流程如下：

调度器锁住FairScheduler对象，避免核心数据结构冲突。
调度器选取集群的一个节点(node)，从树形队列的根节点ROOT开始出发，每层队列都会按照公平策略选择一个子队列，最后在叶子队列按照公平策略选择一个App，为这个App在node上找一块适配的资源。

对于每层队列进行如下流程：

队列预先检查：检查队列的资源使用量是否已经超过了队列的Quota
排序子队列/App：按照公平调度策略，对子队列/App进行排序
递归调度子队列/App

例如，某次调度的路径是ROOT -> ParentQueueA -> LeafQueueA1 -> App11，这次调度会从node上给App11分配Container。

FairScheduler架构

公平调度器是一个多线程异步协作的架构，而为了保证调度过程中数据的一致性，在主要的流程中加入了FairScheduler对象锁。其中核心调度流程是单线程执行的。这意味着Container分配是串行的，这是调度器存在性能瓶颈的核心原因。

scheduler Lock：FairScheduler对象锁
AllocationFileLoaderService：负责公平策略配置文件的热加载，更新队列数据结构
Continuous Scheduling Thread：开启连续调度时的核心调度线程，不停地执行分配container的核心调度流程。
Update Thread：更新队列资源需求，执行Container抢占流程等
Scheduler Event Dispatcher Thread: 调度器事件的处理器，处理App新增，App结束，node新增，node移除等事件

FairScheduler的资源调度方式

FairScheduler支持2种资源调度方式：心跳调度和连续调度。

心跳调度方式：NodeManager向ResourceManager汇报了自身资源情况（比如，当前可用资源，正在使用的资源，已经释放的资源)，这个RPC会触发ResourceManager调用nodeUpdate()方法，这个方法为这个节点进行一次资源调度，即，从维护的Queue中取出合适的应用的资源请求(合适，指的是这个资源请求既不违背队列的最大资源使用限制，也不违背这个NodeManager的剩余资源量限制)放到这个NodeManager上运行。这种调度方式一个主要缺点就是调度缓慢，当一个NodeManager即使已经有了剩余资源，调度也只能在心跳发送以后才会进行，不够及时。

连续调度方式：由一个独立的线程ContinuousSchedulingThread负责进行持续的资源调度，与NodeManager的心跳是异步进行的。即不需要等到NodeManager发来心跳才开始资源调度。

FairSharePolicy的比较器

FariSchaeduler根据FairSharePolicy的比较器，对队列/app进行排序。

两个组, 排序的规则是:

1. 一个需要资源, 另外一个不需要资源, 则需要资源的排前面

2. 若都需要资源的话, 对比使用的内存占minShare的比例, 比例小的排前面, (即尽量保证达到minShare)

3. 若比例相同的话, 计算出使用量与权重的比例, 小的排前面, 即权重大的优先, 使用量小的优先.

4. 若还是相同, 提交时间早的优先, app id小的排前面.

/**
   * Compare Schedulables mainly via fair share usage to meet fairness.
   * Specifically, it goes through following four steps.
   *
   * 1. Compare demands. Schedulables without resource demand get lower priority
   * than ones who have demands.
   * 
   * 2. Compare min share usage. Schedulables below their min share are compared
   * by how far below it they are as a ratio. For example, if job A has 8 out
   * of a min share of 10 tasks and job B has 50 out of a min share of 100,
   * then job B is scheduled next, because B is at 50% of its min share and A
   * is at 80% of its min share.
   * 
   * 3. Compare fair share usage. Schedulables above their min share are
   * compared by fair share usage by checking (resource usage / weight).
   * If all weights are equal, slots are given to the job with the fewest tasks;
   * otherwise, jobs with more weight get proportionally more slots. If weight
   * equals to 0, we can't compare Schedulables by (resource usage/weight).
   * There are two situations: 1)All weights equal to 0, slots are given
   * to one with less resource usage. 2)Only one of weight equals to 0, slots
   * are given to the one with non-zero weight.
   *
   * 4. Break the tie by compare submit time and job name.
   */
  private static class FairShareComparator implements Comparator<Schedulable>,
      Serializable {
    private static final long serialVersionUID = 5564969375856699313L;

    @Override
    public int compare(Schedulable s1, Schedulable s2) {
      int res = compareDemand(s1, s2);

      // Share resource usages to avoid duplicate calculation
      Resource resourceUsage1 = null;
      Resource resourceUsage2 = null;

      if (res == 0) {
        resourceUsage1 = s1.getResourceUsage();
        resourceUsage2 = s2.getResourceUsage();
        res = compareMinShareUsage(s1, s2, resourceUsage1, resourceUsage2);
      }

      if (res == 0) {
        res = compareFairShareUsage(s1, s2, resourceUsage1, resourceUsage2);
      }

      // Break the tie by submit time
      if (res == 0) {
        res = (int) Math.signum(s1.getStartTime() - s2.getStartTime());
      }

      // Break the tie by job name
      if (res == 0) {
        res = s1.getName().compareTo(s2.getName());
      }

      return res;
    }
  }

心跳调度源码分析

略，请参阅博客：yarn3.2源码分析之NM与RM通信完成心跳调度

连续调度源码分析

ContinuousSchedulingThread

用于连续调度的线程。连续调度默认不开启，只有设置yarn.scheduler.fair.continuous-scheduling-enabled参数为true，才会启动该线程。连续调度现在已经不推荐了，因为它会因为锁的问题，而导致资源调度变得缓慢。可以使用yarn.scheduler.assignmultiple参数启动批量分配功能，作为连续调度的替代品。

 /**
   * Thread which attempts scheduling resources continuously,
   * asynchronous to the node heartbeats.
   */
  @Deprecated
  private class ContinuousSchedulingThread extends Thread {

    @Override
    public void run() {
      while (!Thread.currentThread().isInterrupted()) {
        try {
          continuousSchedulingAttempt();
          Thread.sleep(getContinuousSchedulingSleepMs());
        } catch (InterruptedException e) {
          LOG.warn("Continuous scheduling thread interrupted. Exiting.", e);
          return;
        }
      }
    }
  }

continuousSchedulingAttempt()方法