yarn3.2源码分析之FairScheduler抢占式调度

概述

FairScheduler可以通过配置yarn.scheduler.fair.preemption参数为true,开启抢占式调度,默认为false,即不开启。

 FairScheduler在计算这个队列允许抢占其它队列的资源大小时,如果这个队列使用的资源低于其minshare的时间超过了抢占超时时间,那么,应该抢占的资源量就在它当前的fair share和它的min share之间的差额。如果队列资源已经低于它的fair share的时间超过了fairSharePreemptionTimeout,那么他应该进行抢占的资源就是满足其fair share的资源总量。如果两者都发生了,则抢占两个的较多者。

抢占式调度主要由2个线程配合完成:

  • UpdateThread:计算队列和app的fairShare,然后计算app的fairshareStarvation量和app的minshareStarvation量,并记录这些处于starvation状态的app。
  • FSPreemptionThread:处理处于starvation状态的app。基于给定的app,获取一批可用于抢占的cotainer,并在必要时kill掉这些container。

 FSPreemptionThread

 FairScheduler在启动服务时,会启动FSPreemptionThread。

如果开启抢占式调度,则会创建一个后台线程FSPreemptionThread 。 

if (this.conf.getPreemptionEnabled()) {
        createPreemptionThread();
      }

FSPreemptionThread的run()方法逻辑

  1. FairScheduler的上下文FSContext的FSStarvedApps变量中维护了一个PriorityBlockingQueue<FSAppAttempt>队列,该队列中记录了处于starvation状态的app。FSPreemptedThread从队列中获取处于starvation状态的app 。
  2. 基于给定的app,抢占一定数量的container用来满足该app的需求。返回可用于抢占的container。
  3. 在时间超过waitTimeBeforeKill后,标记为待抢占的container仍没有被释放,则preemptContainers()方法主动kill掉该container。
public void run() {
    while (!Thread.interrupted()) {
      try {
//FairScheduler的上下文FSContext的FSStarvedApps变量中维护了一个PriorityBlockingQueue<FSAppAttempt>队列
//该队列中记录了处于starvation状态的app
//FSPreemptedThread从队列中获取处于starvation状态的app
        FSAppAttempt starvedApp = context.getStarvedApps().take();
        // Hold the scheduler readlock so this is not concurrent with the
        // update thread.
        schedulerReadLock.lock();
        try {
//identifyContainersToPreempt()方法:基于给定的app,抢占一定数量的container用来满足该app的需求。返回可用于抢占的container。
//preemptContainers()方法:在时间超过waitTimeBeforeKill后,标记为待抢占的container仍没有被释放,则preemptContainers()方法主动kill掉该container。
          preemptContainers(identifyContainersToPreempt(starvedApp));
        } finally {
          schedulerReadLock.unlock();
        }
        starvedApp.preemptionTriggered(delayBeforeNextStarvationCheck);
      } catch (InterruptedException e) {
        LOG.info("Preemption thread interrupted! Exiting.");
        Thread.currentThread().interrupt();
      }
    }
  }

preemptContainers()方法

 如果这个container已经被标记为待抢占,并且距离标记时间已经超过了waitTimeBeforeKill却依然没有被自己的ApplicationMaster主动释放的container,那么直接杀死这个Container。

private void preemptContainers(List<RMContainer> containers) {
    // Schedule timer task to kill containers
    preemptionTimer.schedule(
        new PreemptContainersTask(containers), warnTimeBeforeKill);
  }

PreemptContainersTask

执行kill container事件 

private class PreemptContainersTask extends TimerTask {
    private final List<RMContainer> containers;

    PreemptContainersTask(List<RMContainer> containers) {
      this.containers = containers;
    }

    @Override
    public void run() {
      for (RMContainer container : containers) {
        ContainerStatus status = SchedulerUtils.createPreemptedContainerStatus(
            container.getContainerId(), SchedulerUtils.PREEMPTED_CONTAINER);

        LOG.info("Killing container " + container);
        scheduler.completedContainer(
            container, status, RMContainerEventType.KILL);
      }
    }
  }

UpdateThread

 AbstractYarnScheduler在启动服务时,会启动UpdateThread。UpdateThread是AbstractYarnScheduler的内部类。

/**
   * Thread which calls {@link #update()} every
   * <code>updateInterval</code> milliseconds.
   */
  private class UpdateThread extends Thread {
    @Override
    public void run() {
      while (!Thread.currentThread().isInterrupted()) {
        try {
          synchronized (updateThreadMonitor) {
            updateThreadMonitor.wait(updateInterval);
          }
          update();
        } catch (InterruptedException ie) {
          LOG.warn("Scheduler UpdateThread interrupted. Exiting.");
          return;
        } catch (Exception e) {
          LOG.error("Exception in scheduler UpdateThread", e);
        }
      }
    }
  }

FairShcheduler#update()方法

 FairShcheduler使用该方法重新计算它的内部变量,包括每个job的权重,fairShare、已使用资源量、每个job的资源需求量等。

  • 递归计算队列和它的子队列的资源需求量
  • 递归计算队列和它的子队列/app的fairShare
  • 更新叶子队列上所有FSAppAttempt的minshareStarvation和fairshareStarvation
/**
   * Recompute the internal variables used by the scheduler - per-job weights,
   * fair shares, deficits, minimum slot allocations, and amount of used and
   * required resources per job.
   */
  @VisibleForTesting
  @Override
  public void update() {
    // Storing start time for fsOpDurations
    long start = getClock().getTime();
    FSQueue rootQueue = queueMgr.getRootQueue();

    // Update demands and fairshares
    writeLock.lock();
    try {
      // Recursively update demands for all queues
//递归计算队列和它的子队列的资源需求量
      rootQueue.updateDemand();
//从根队列开始,递归计算队列和它的子队列/app的fairShare
//root queue的fairShare是整个集群的可用资源
      rootQueue.update(getClusterResource());

      // Update metrics
      updateRootQueueMetrics();
    } finally {
      writeLock.unlock();
    }

    readLock.lock();
    try {
      // Update starvation stats and identify starved applications
      if (shouldAttemptPreemption()) {
        for (FSLeafQueue queue : queueMgr.getLeafQueues()) {
//更新叶子队列上所有FSAppAttempt的minshareStarvation和fairshareStarvation
          queue.updateStarvedApps();
        }
      }

      // Log debug information
      if (STATE_DUMP_LOG.isDebugEnabled()) {
        if (--updatesToSkipForDebug < 0) {
          updatesToSkipForDebug = UPDATE_DEBUG_FREQUENCY;
          dumpSchedulerState();
        }
      }
    } finally {
      readLock.unlock();
    }
    fsOpDurations.addUpdateThreadRunDuration(getClock().getTime() - start);
  }

从根队列开始递归计算并更新队列及其子队列/app的fairShare

FSQueue#update()方法

/**
   * Set the queue's fairshare and update the demand/fairshare of child
   * queues/applications.
   *
   * To be called holding the scheduler writelock.
   *
   * @param fairShare
   */
  public void update(Resource fairShare) {
    setFairShare(fairShare);
    updateInternal();
  }

FSParentQueue#updateInternal()方法

基于配置的调度策略,计算所有子队列的fairShare。FairScheduler支持的调度策略有:fair、fifo、drft。默认是fair。

 void updateInternal() {
    readLock.lock();
    try {
      policy.computeShares(childQueues, getFairShare());
      for (FSQueue childQueue : childQueues) {
        childQueue.getMetrics().setFairShare(childQueue.getFairShare());
        childQueue.updateInternal();
      }
    } finally {
      readLock.unlock();
    }
  }

FSLeafQueue#updateInternal()方法

基于配置的调度策略,计算叶子队列中所有运行的app的fairShare。

@Override
  void updateInternal() {
    readLock.lock();
    try {
      policy.computeShares(runnableApps, getFairShare());
    } finally {
      readLock.unlock();
    }
  }

FairSharePolicy计算fairShare

每个Schedulable对象都有minShare、maxShare、fairShare 3个属性,此外还有权重属性(weight)。对于queue而言,minShare、maxShare就是fair-scheduler.xml里配置的minResource和maxResource,weight也是直接配置的。对于FSAppAttempt而言minShare直接返回0,maxShare直接返回Long.MAX_VALUE,weight如果没有配置yarn.scheduler.fair.sizebasedweight=true就直接返回1.0,意味着所有app的权重是相同的。

 @Override
  public void computeShares(Collection<? extends Schedulable> schedulables,
      Resource totalResources) {
//FairSharePolicy只考虑内存资源,所以这里传入的资源类型是memory
    ComputeFairShares.computeShares(schedulables, totalResources, MEMORY);
  }

computeShares()方法

/**
   * Compute fair share of the given schedulables.Fair share is an allocation of
   * shares considering only active schedulables ie schedulables which have
   * running apps.
   * 
   * @param schedulables
   * @param totalResources
   * @param type
   */
  public static void computeShares(
      Collection<? extends Schedulable> schedulables, Resource totalResources,
      String type) {
//computeShares相当于computeSharesInternal()的重载方法
//即设置computeSharesInternal()方法的最后一个参数isSteadyShare为false
    computeSharesInternal(schedulables, totalResources, type, false);
  }

computeSharesInternal()方法

假设每个Scheduler的minShare和maxShare都是事先给定好的,基于每个Scheduler的minShare和maxShare、总的可用资源totalResources(totalSlots)、每个Scheduler的权重weight,计算每个Scheduler的weighted fairShare。要求不允许计算得出的fairShare低于Scheduler的minShare,也不允许高于Scheduler的maxShare。因此,我们的处理策略如下:

  • 如果Scheduler的minShare > weight*R值,设置Scheduler的fairshare值 = minShare
  • 如果Scheduler的maxShare < weight*R值,设置Scheduler的fairshare值 = maxShare
  • 其它的Scheduler的fairshare值 = weight*R值
  • Scheduler的fairshare值之和 = 总的可用资源totalResources

我们称R值为weight-to-slots ratio,因为它成功将Scheduler的权重weight转换成了Scheduler应分配得到的资源数量。

我们通过二分查找,找到合适的R值,使Scheduler的fairshare值最大化。在二分查找过程中,我们初始化R值为0,意味着所有的Scheduler在初始化时只能分配到它们的minshare值。每次将R值 * 2 进行加倍,直至Schedulable的fairshare值总和 >= totalResource。resourceUsedWithWeightToResourceRatio()方法会基于给定的R值,计算出Schedulable的fairshare值总和。

 private static void computeSharesInternal(
      Collection<? extends Schedulable> allSchedulables,
      Resource totalResources, String type, boolean isSteadyShare) {

    Collection<Schedulable> schedulables = new ArrayList<>();
//计算allSchedulables集合中固定的fairShare的和
//如果Schedulable的maxShare为0,或者weight为0,则它的fairShare为fixed share
//如果Schedulable的maxShare为0,则该Schedulable的固定fairShare为0
//如果Schedulable的weight为0,则该Schedulable的固定fairShare为它的minShare
    int takenResources = handleFixedFairShares(
        allSchedulables, schedulables, isSteadyShare, type);

    if (schedulables.isEmpty()) {
      return;
    }
    // Find an upper bound on R that we can use in our binary search. We start
    // at R = 1 and double it until we have either used all the resources or we
    // have met all Schedulables' max shares.
//计算allSchedulables集合的maxShare的总和
//该总和作为在二分查找中R的上边界。我们初始化R为1,每次对R进行加倍,直至R值遇到该上边界或者totalResource为止
    int totalMaxShare = 0;
    for (Schedulable sched : schedulables) {
      long maxShare = sched.getMaxShare().getResourceValue(type);
      totalMaxShare = (int) Math.min(maxShare + (long)totalMaxShare,
          Integer.MAX_VALUE);
      if (totalMaxShare == Integer.MAX_VALUE) {
        break;
      }
    }

    long totalResource = Math.max((totalResources.getResourceValue(type) -
        takenResources), 0);
//重置totalResource为totalMaxShare和totalResource的较小值
    totalResource = Math.min(totalMaxShare, totalResource);

//初始化R值为1
    double rMax = 1.0;
//每次对R进行加倍,直至Schedulable的fairshare值总和 >= totalResource
//resourceUsedWithWeightToResourceRatio()方法逻辑:计算每个Schedulable的fairshare值,并进行求和。
//fairshare值 = weight*R , if minShare <= weight*R <= maxShare
//fairshare值 = minShare , if weight*R < minShare
//fairshare值 = maxShare , if maxShare < weight*R
    while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
        < totalResource) {
      rMax *= 2.0;
    }
    // Perform the binary search for up to COMPUTE_FAIR_SHARES_ITERATIONS steps
//使用二分查找,进一步逼近R值,使得Schedulable的fairshare值总和 == totalResource
    double left = 0;
    double right = rMax;
    for (int i = 0; i < COMPUTE_FAIR_SHARES_ITERATIONS; i++) {
      double mid = (left + right) / 2.0;
      int plannedResourceUsed = resourceUsedWithWeightToResourceRatio(
          mid, schedulables, type);
      if (plannedResourceUsed == totalResource) {
        right = mid;
        break;
      } else if (plannedResourceUsed < totalResource) {
        left = mid;
      } else {
        right = mid;
      }
    }
    // Set the fair shares based on the value of R we've converged to
    for (Schedulable sched : schedulables) {
      Resource target;

      if (isSteadyShare) {
        target = ((FSQueue) sched).getSteadyFairShare();
      } else {
        //获取每个Schedulable的fairShare变量
        target = sched.getFairShare();
      }
//基于最终得出的R值,设置每个Schedulable的fairShare值为weight*R
      target.setResourceValue(type, (long)computeShare(sched, right, type));
    }
  }

 

计算并更新叶子队列上所有FSAppAttempt的minshareStarvation和fairshareStarvation

FSLeafQueue#updateStarvedApps()方法

队列的饥饿状态有2种:minshare starvation状态和fairshare starvation状态。

minshare的大小只由队列确定。

faireshare的大小同时由队列和应用确定。

如果队列处于minshare starvation状态,我们需要挑选出最紧缺资源的app。

如果队列处于fairshare starvation状态,最少有一个app处于starvation状态。

  /**
   * Helper method to identify starved applications. This needs to be called
   * ONLY from {@link #updateInternal}, after the application shares
   * are updated.
   *
   * Caller does not need read/write lock on the leaf queue.
   */
  void updateStarvedApps() {
    // Fetch apps with pending demand
    TreeSet<FSAppAttempt> appsWithDemand = fetchAppsWithDemand(false);

    // Process apps with fairshare starvation
//计算app的已使用资源是否已经低于它的fair share,并且时间超过了fairSharePreemptionTimeout。
//记录处于fairshare starvation状态的app,返回这批app的fairshare差量的总和。
    Resource fairShareStarvation = updateStarvedAppsFairshare(appsWithDemand);

    // Compute extent of minshare starvation
//计算队列的minShareStarvation
    Resource minShareStarvation = minShareStarvation();

    // Compute minshare starvation that is not subsumed by fairshare starvation
//minShareStarvation = minShareStarvation - fairShareStarvation
//计算不归入fairShareStarvation的minShareStarvation量
    Resources.subtractFromNonNegative(minShareStarvation, fairShareStarvation);

    // Assign this minshare to apps with pending demand over fairshare
//计算app的minshareStarvation量
    updateStarvedAppsMinshare(appsWithDemand, minShareStarvation);
  }

FSLeafQueue#updateStarvedAppsMinshare()方法

计算app的minshareStarvation量 = app的资源需求量pendingDemand - app的fairshareStarvation量 - 队列的minshareStarvation量。

将处于minshareStarvation状态的app记录到FairScheduler的上下文FSContext的FSStarvedApps变量中。FSStarvedApps变量内部维护了一个PriorityBlockingQueue<FSAppAttempt>队列,它会按照总的starvation量(app的fairshareStarvation量 + app的minshareStarvation量)的大小将app在队列中进行排序。这个队列最终会被FSPreemptionThread处理。

/**
   * Distribute minshare starvation to a set of apps
   * @param appsWithDemand set of apps
   * @param minShareStarvation minshare starvation to distribute
   */
  private void updateStarvedAppsMinshare(
      final TreeSet<FSAppAttempt> appsWithDemand,
      final Resource minShareStarvation) {
//pending = minShareStarvation
    Resource pending = Resources.clone(minShareStarvation);

    // Keep adding apps to the starved list until the unmet demand goes over
    // the remaining minshare
    for (FSAppAttempt app : appsWithDemand) {
      if (!Resources.isNone(pending)) {
        Resource appMinShare = app.getPendingDemand();
//appMinShare = appMinShare - app.getFairshareStarvation()
        Resources.subtractFromNonNegative(
            appMinShare, app.getFairshareStarvation());

        if (Resources.greaterThan(policy.getResourceCalculator(),
            scheduler.getClusterResource(), appMinShare, pending)) {
//appMinShare = appMinShare - pending
          Resources.subtractFromNonNegative(appMinShare, pending);
          pending = none();
        } else {
//pending = pending - appMinShare
          Resources.subtractFromNonNegative(pending, appMinShare);
        }
//记录app的minshareStarvation量为appMinShare
        app.setMinshareStarvation(appMinShare);
//将处于minshareStarvation状态的app记录到FairScheduler的上下文FSContext的FSStarvedApps变量中
        context.getStarvedApps().addStarvedApp(app);
      } else {
        // Reset minshare starvation in case we had set it in a previous
        // iteration
        app.resetMinshareStarvation();
      }
    }
  }

FSLeafQueue#minShareStarvation()方法

计算队列的minshareStarvation量。计算队列的已使用资源是否已经低于它的minshare,并且时间超过了MinSharePreemptionTimeout,并返回相差minshare的差量。

/**
   * Helper method to compute the amount of minshare starvation.
   *
   * @return the extent of minshare starvation
   */
  private Resource minShareStarvation() {
    // If demand < minshare, we should use demand to determine starvation
//starvation量 = min(minshare,资源需求量demand)
    Resource starvation =
        Resources.componentwiseMin(getMinShare(), getDemand());
//starvation量 = starvation量 -  getResourceUsage()
//计算出相差minshare的差量,或者相差demand的差量
    Resources.subtractFromNonNegative(starvation, getResourceUsage());

//通过判断饥饿量是否为none,从而判断队列当前是否处于minshare starvation状态
    boolean starved = !Resources.isNone(starvation);
    long now = scheduler.getClock().getTime();

    if (!starved) {
      // Record that the queue is not starved
//如果队列不处于minshare starvation状态,设置上次计算minshare的时间为当前时间
      setLastTimeAtMinShare(now);
    }

    if (now - lastTimeAtMinShare < getMinSharePreemptionTimeout()) {
      // the queue is not starved for the preemption timeout
//即使饥饿量不为none,但是如果时间没有超过MinSharePreemptionTimeout,重置饥饿量为none
      starvation = Resources.clone(Resources.none());
    }

    return starvation;
  }

 

FSLeafQueue#updateStarvedAppsFairShare()方法

 计算一批app的资源需求量相差fairShare的差量,并将这些处于fairshare starvation状态的app记录到FairScheduler的上下文FSContext的FSStarvedApps变量中。最后,返回这批app的差量之和。

FSStarvedApps内部维护了一个PriorityBlockingQueue<FSAppAttempt>队列,它会按照总的starvation量(app的fairshareStarvation量 + app的minshareStarvation量)的大小将app在队列中进行排序。这个队列最终会被FSPreemptionThread处理。

/**
   * Compute the extent of fairshare starvation for a set of apps.
   *
   * @param appsWithDemand apps to compute fairshare starvation for
   * @return aggregate fairshare starvation for all apps
   */
  private Resource updateStarvedAppsFairshare(
      TreeSet<FSAppAttempt> appsWithDemand) {
    Resource fairShareStarvation = Resources.clone(none());
    // Fetch apps with unmet demand sorted by fairshare starvation
    for (FSAppAttempt app : appsWithDemand) {
//计算每个app的资源需求量相差fairShare的差量
      Resource appStarvation = app.fairShareStarvation();
      if (!Resources.isNone(appStarvation))  {
//将starvation状态的app记录到FairScheduler的上下文FSContext。
        context.getStarvedApps().addStarvedApp(app);
//累加每个app的差量
        Resources.addTo(fairShareStarvation, appStarvation);
      } else {
        break;
      }
    }
//返回这批app的差量之和
    return fairShareStarvation;
  }

FSAppAttempt#fairShareStarvation()方法

计算app的已使用资源是否已经低于它的fair share,并且时间超过了fairSharePreemptionTimeout。如果是,

计算app的fairshareStarvation = fairDemand - getResourceUsage()

/**
   * Helper method that computes the extent of fairshare starvation.
   * @return freshly computed fairshare starvation
   */
  Resource fairShareStarvation() {
    long now = scheduler.getClock().getTime();
    Resource threshold = Resources.multiply(
        getFairShare(), getQueue().getFairSharePreemptionThreshold());
    Resource fairDemand = Resources.componentwiseMin(threshold, demand);

    // Check if the queue is starved for fairshare
//判断队列资源是否已经低于它的fair share
    boolean starved = isUsageBelowShare(getResourceUsage(), fairDemand);

    if (!starved) {
      lastTimeAtFairShare = now;
    }

    if (!starved ||
        now - lastTimeAtFairShare <
            getQueue().getFairSharePreemptionTimeout()) {
      fairshareStarvation = Resources.none();
    } else {
//如果队列资源已经低于它的fair share,并且时间超过了fairSharePreemptionTimeout
      // The app has been starved for longer than preemption-timeout.
//计算app的fairshareStarvation = fairDemand - getResourceUsage()
      fairshareStarvation =
          Resources.subtractFromNonNegative(fairDemand, getResourceUsage());
    }
    return fairshareStarvation;
  }

 

 

 

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值