深入理解YARN全局调度和源码分析

最新推荐文章于 2024-07-02 15:21:13 发布

凡哲_Lucas

最新推荐文章于 2024-07-02 15:21:13 发布

阅读量1.7k

点赞数 1

分类专栏： Yarn

本文链接：https://blog.csdn.net/weixin_35792948/article/details/106360963

版权

本文深入探讨YARN全局调度的概念，对比原始心跳调度模式，详细解析全局调度如何反客为主，提高并发判断效率，并通过Capacity Scheduler源码分析集群资源的生产和消费过程，揭示全局调度如何提升集群吞吐量。

摘要由CSDN通过智能技术生成

深入理解YARN全局调度和源码分析

背景
原始的心跳调度模式
全局调度的调度模式
- 全局调度反客为主
- 全局调度并发判断
基于Capacity Scheduler的全局调度源码分析
- 集群资源生产过程源码分析
- 集群资源消费过程源码分析

背景

之前写了一篇文章介绍了全局调度思想 YARN global scheduling （全局调度思想解析）。本文将对全局调度YARN的实现做全面的深入分析和源码解析。全局调度在未来，混合异构等计算集群，和存储计算分离大趋势下，是一个非常重要的特性。

原始的心跳调度模式

在这里插入图片描述
如上图所示，调度选择过程，从root节点到子节点，然后是app，然后是app内选择该app的优先级最高的Container请求。然后判断资源的生产者，Node心跳是否满足该Container消费者的需求，满足就把资源给该Container。
我们发现上述的调度过程，在接受心跳判断资源分配之前，调度算法选择需要调度的Container的时候，是和资源生产者解耦的。这一块正好是全局调度的切入点。

全局调度的调度模式

全局调度反客为主

在这里插入图片描述
我们看到原始的心跳调度模式，是心跳驱动的，Node心跳过来是唯一的生产者，心跳就是生产资源的过程，那么消费者选择范围就很窄。全局调度反过来，让你生产者提前想好你想要的资源，按照你想要的顺序，尽量得到你最想要的资源。

举个例子：

原始模型：你去水果店买水果，不同水果店就是心跳节点，原始的模型就是，不同的水果店里面有水果就告诉你去拿，你虽然每次都有水果拿，但是如果你一定要只要香蕉，但是可能有水果的水果店通知你第100次才有香蕉这个选项。

全局调度模型：你想要香蕉，你持续的去问那些最有可能有香蕉的水果店，你翻身做主人，不再去等待水果店通知你。

全局调度并发判断

在这里插入图片描述
YARN里面的调度判断模型本身就是单线程的，从root节点到子节点，再到app，最后到app内优先级最高到container。

由上面分析可知，整个判断逻辑是和节点上的资源生产解耦合的。那么全局调度把这一块变成并发了，如上图所示，同时由多个判断进行，得到资源的消费者需求。然后剥离出一个单独的线程，来判断这些需求是否合理，生产者是否满足这些需求，如果合理就产生消费过程。

并发判断，加速了资源消费者的消费，如果原始的生产者生产的资源出现堆积的时候，很明显全局调度理论上增大了集群的吞吐量。

实际使用的时候，应该需要根据情况，对线程池大小等进行优化，来达到最大的调度吞吐量。

基于Capacity Scheduler的全局调度源码分析

集群资源生产过程源码分析

在CapacityScheduler类中异步的调度线程AsyncScheduleThread池去进行并发逻辑判断，然后产生能分配的Container。

 private void startSchedulerThreads() {
   
    writeLock.lock();
    try {
   
      activitiesManager.start();
      if (scheduleAsynchronously) {
   
        Preconditions.checkNotNull(asyncSchedulerThreads,
            "asyncSchedulerThreads is null");
        for (Thread t : asyncSchedulerThreads) {
   
          t.start();
        }

        resourceCommitterService.start();
      }
    } finally {
   
      writeLock.unlock();
    }
  }

对应的单个线程的Container分配判断逻辑。

 static class AsyncScheduleThread extends Thread {
   

    private final CapacityScheduler cs;
    private AtomicBoolean runSchedules = new AtomicBoolean(false);

    public AsyncScheduleThread(CapacityScheduler cs) {
   
      this.cs = cs;
      setDaemon(true);
    }

    @Override
    public void run() {
   
      int debuggingLogCounter = 0;
      while (!Thread.currentThread().isInterrupted()) {
   
        try {
   
          if (!runSchedules.get()) {
   
            Thread.sleep(100);
          } else {
   
            // Don't run schedule if we have some pending backlogs already
            if (cs.getAsyncSchedulingPendingBacklogs()
                > cs.asyncMaxPendingBacklogs) {
   
              Thread.sleep(1);
            } else{
   
              schedule(cs);
              if(LOG.isDebugEnabled()) {
   
                // Adding a debug log here to ensure that the thread is alive
                // and running fine.
                if (debuggingLogCounter++ > 10000) {
   
                  debuggingLogCounter = 0;
                  LOG.debug("AsyncScheduleThread[" + getName() + "] is running!");
                }
              }
            }
          }
        } catch (InterruptedException ie) {
   
          // keep interrupt signal
          Thread.currentThread().interrupt();
        }
      }
      LOG.info("AsyncScheduleThread[" + getName() + "] exited!");
    }

如果调度等待的队列小于设定的值，默认是每隔1ms进行一次调度判断。

/**
   * Schedule on all nodes by starting at a random point.
   * @param cs
   */
  static void schedule(CapacityScheduler cs) throws InterruptedException{
   
    // First randomize the start point
    int current = 0;
    Collection<FiCaSchedulerNode> nodes = cs.nodeTracker.getAllNodes();

    // If nodes size is 0 (when there are no node managers registered,
    // we can return from here itself.
    int nodeSize = nodes.size();
    if(nodeSize == 0) {
   
      return;
    }
    int start = random.nextInt(nodeSize);

    // To avoid too verbose DEBUG logging, only print debug log once for
    // every 10 secs.
    boolean printSkipedNodeLogging = false;
    if (Time.monotonicNow() / 1000 % 10 == 0) {
   
      printSkipedNodeLogging = (!printedVerboseLoggingForAsyncScheduling);
    } else {
   
      printedVerboseLoggingForAsyncScheduling = false;
    }

    // Allocate containers of node [start, end)
    for (FiCaSchedulerNode node : nodes) {
   
      if (current++ >= start) {
   
        if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
   
          continue;
        }
        cs.allocateContainersToNode(node.getNodeID(), false);
      }
    }

    current = 0;

    // Allocate containers of node [0, start)
    for (FiCaSchedulerNode node : nodes

最低0.47元/天解锁文章

凡哲_Lucas

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
5
评论
深入理解YARN全局调度和源码分析

深入理解YARN全局调度和源码分析背景原始的心跳调度模式全局调度的调度模式全局调度反客为主全局调度并发判断基于Capacity Scheduler的全局调度源码分析集群资源生产过程源码分析集群资源消费过程源码分析背景之前写了一篇文章介绍了全局调度思想 YARN global scheduling （全局调度思想解析）。本文将对全局调度YARN的实现做全面的深入分析和源码解析。全局调度在未来，混合异构等计算集群，和存储计算分离大趋势下，是一个非常重要的特性。原始的心跳调度模式如上图所示，调度选择过程
复制链接

扫一扫