CapacityScheduler --ApplicationMaster资源分配

最新推荐文章于 2022-11-18 20:44:07 发布

巴氏旅人

最新推荐文章于 2022-11-18 20:44:07 发布

阅读量1.2k

点赞数 1

分类专栏： java hadoop源码初读

本文链接：https://blog.csdn.net/zhusirong/article/details/84319940

版权

本文详细介绍了CapacityScheduler在Hadoop 2.7.6中的资源分配过程，包括NODE_UPDATE事件处理、队列选择策略、资源分配条件及ApplicationMaster启动资源要求。在节点更新时，CapacityScheduler根据心跳信息进行资源调度。分配逻辑涉及队列匹配、资源上限检查以及应用启动。队列选择基于可用容量和路径长度，并通过事件触发启动Container。

摘要由CSDN通过智能技术生成

CapacityScheduler --ApplicationMaster资源分配(基于hadoop 2.7.6)

资源分配是被动分配的方式,在数据节点发送心跳(NODE_UPDATE)时,根据数据节点汇报的资源情况进行调度分配.

先贴下: ApplicationMaster启动需要的资源多少(memory和virtualcores)在客户端提交应用程序的时候已经初始化(在YARNRunner类里),memory默认是1536M,virtualcores默认是1.

代码清单:

	case NODE_UPDATE:
    {
   
      NodeUpdateSchedulerEvent nodeUpdatedEvent = (NodeUpdateSchedulerEvent)event;
      RMNode node = nodeUpdatedEvent.getRMNode();
      /**
       * 	更新节点信息:
       * 	1.处理已分配的container
       * 		触发RMContainerEventType.LAUNCHED事件是由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了
       * 	2.处理已经完成的container
       * 		主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
       */
      nodeUpdate(node);
      /**
       * 是否异步分配,默认值是false,默认capacity-scheduler.xml配置文件里是没有配置的.
       * 配置项:yarn.scheduler.capacity.scheduler-asynchronously.enable
       */
      if (!scheduleAsynchronously) {
   
    	  /**
    	   * 	进行资源分配
    	   */
		allocateContainersToNode(getNode(node.getNodeID()));
      }
    }

NODE_UPDATE事件处理逻辑:
1.节点更新信息处理
2.分配资源

/**
   * 	1.处理已分配的container
   * 		触发RMContainerEventType.LAUNCHED事件,该事件由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了(在处理APP_ATTEMPT_ADDED事件时,会将container加入到containerAllocationExpirer进行监控)
   * 
   * 	2.处理已经完成的container
   * 		主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
   * @param nm
   */
   private synchronized void nodeUpdate(RMNode nm) {
   
    if (LOG.isDebugEnabled()) {
   
      LOG.debug("nodeUpdate: " + nm + " clusterResources: " + clusterResource);
    }
    FiCaSchedulerNode node = getNode(nm.getNodeID());
    List<UpdatedContainerInfo> containerInfoList = nm.pullContainerUpdates();
    List<ContainerStatus> newlyLaunchedContainers = new ArrayList<ContainerStatus>();
    List<ContainerStatus> completedContainers = new ArrayList<ContainerStatus>();
    for(UpdatedContainerInfo containerInfo : containerInfoList) {
   
      newlyLaunchedContainers.addAll(containerInfo.getNewlyLaunchedContainers());
      completedContainers.addAll(containerInfo.getCompletedContainers());
    }
    
    // Processing the newly launched containers
    for (ContainerStatus launchedContainer : newlyLaunchedContainers) {
   
    	/**
         * 	触发RMContainerEventType.LAUNCHED事件,该事件由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了(在处理APP_ATTEMPT_ADDED事件时,会将container加入到containerAllocationExpirer进行监控)
         */
      containerLaunchedOnNode(launchedContainer.getContainerId(), node);
    }

    // Process completed containers
    for (ContainerStatus completedContainer : completedContainers) {
   
      ContainerId containerId = completedContainer.getContainerId();
      LOG.debug("Container FINISHED: " + containerId);
      /**
       * 	主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
       */
      completedContainer(getRMContainer(containerId), 
          completedContainer, RMContainerEventType.FINISHED);
    }

    // Now node data structures are upto date and ready for scheduling.
    if(LOG.isDebugEnabled()) {
   
      LOG.debug("Node being looked for scheduling " + nm
        + " availableResource: " + node.getAvailableResource());
    }
  }

更新数据节点信息:
     1.处理已分配的container
           触发RMContainerEventType.LAUNCHED事件,该事件是由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了
    2.处理已经完成的container
           主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新

在贴分配逻辑代码前,先YY几个问题:
1.分配是以队列为单位,那么是怎么选队列的(按什么顺序、条件选队列)?
2.选中队列后,又是怎么选应用程序进行分配(按什么顺序分配提交到队列内的应用程序)?

/**
* 为了尽量简单,能先看懂主体逻辑流程,先不考虑reserved情况
*/
@VisibleForTesting
  public synchronized void allocateContainersToNode(FiCaSchedulerNode node) {
   
    if (rmContext.isWorkPreservingRecoveryEnabled()
        && !rmContext.isSchedulerReadyForAllocatingContainers()) {
   
      return;
    }
    /**
     * 		数据节点还未注册过
     */
    if (!nodes.containsKey(node.getNodeID())) {
   
      LOG.info("Skipping scheduling as the node " + node.getNodeID() +
          " has been removed");
      return;
    }

    // Assign new containers...
    // 1. Check for reserved applications
    // 2. Schedule if there are no reservations

    /**
     * 		看容器节点上有无预留资源,有预留资源则先用
     * 		
     * 		为了尽量简单,先不考虑reservedContainer情况
     */
    RMContainer reservedContainer = node.getReservedContainer();
    if (reservedContainer != null) {
   
      FiCaSchedulerApp reservedApplication =
          getCurrentAttemptForContainer(reservedContainer.getContainerId());
      
      // Try to fulfill the reservation
      LOG.info("Trying to fulfill reservation for application " + 
          reservedApplication.getApplicationId() + " on node: " + 
          node.getNodeID());
      
      LeafQueue queue = ((LeafQueue)reservedApplication.getQueue());
      CSAssignment assignment =
          queue.assignContainers(
              clusterResource,
              node,
              new ResourceLimits(labelManager.getResourceByLabel(
                  RMNodeLabelsManager.NO_LABEL, clusterResource)));
      
      RMContainer excessReservation = assignment.getExcessReservation();
      if (excessReservation != null) {
   
      Container container = excessReservation.getContainer();
      queue.completedContainer(
          clusterResource, assignment.getApplication(), node, 
          excessReservation, 
          SchedulerUtils.createAbnormalContainerStatus(
              container.getId(), 
              SchedulerUtils.UNRESERVED_CONTAINER), 
          RMContainerEventType.RELEASED, null, true);
      }
    }

    /**
	   * 	minimumAllocation包括最小内存和最小虚拟CPU数,在CapacityScheduler初始化initScheduler的时候初始化
	   * 		最小内存: 配置项是yarn.scheduler.minimum-allocation-mb,默认值是1024M
	   * 		最小虚拟CPU数: 配置项是yarn.scheduler.minimum-allocation-vcores,默认值是1
	   */
    // Try to schedule more if there are no reservations to fulfill
    if (node.getReservedContainer() == null) {
   
    	/**
    	 * 		数据节点的可用资源是否能满足,算法:
    	 * 		node.getAvailableResource()/minimumAllocation
    	 */
      if (calculator.computeAvailableContainers(node.getAvailableResource(),
        minimumAllocation) > 0) {
   
        if (LOG.isDebugEnabled()) {
   
          LOG.debug("Trying to schedule on node: " + node.getNodeName() +
              ", available: " + node.getAvailableResource());
        }
        /**
         * 这里有两个思路或问题:
         * 	1.从root开始匹配,那么先匹配哪个队列呢?
         * 		队列是根据可使用容量来排序遍历,可使用容量越多越靠前
         * 	2.队列内部按什么顺序匹配需求?
         * 		队列内是安排FIFO的顺序匹配需求
         * 
         * 	注意:assignContainers是从根节点开始匹配,assignContainers和assignContainersToChildQueues方法是相互调用的递归方法,
         * 	直到叶子节点的时候才调用叶子节点的assignContainers进行实质上的分配
         */
        root.assignContainers(
            clusterResource,
            node,
            new ResourceLimits(labelManager.getResourceByLabel(
                RMNodeLabelsManager.NO_LABEL, clusterResource)));
      }
    } else {
   
      LOG.info("Skipping scheduling since node " + node.getNodeID() + 
          " is reserved by application " + 
          node.getReservedContainer().getContainerId().getApplicationAttemptId()
          );
    }
  }

allocateContainersToNode方法的主要实现:
  从根节点root开始调用assignContainers进行匹配,一直到叶子节点真正完成分配.这个匹配过程中与parentQueue.assignContainersToChildQueues方法两者相互递归调用完成.
主要的是否可分配的检查逻辑是:
      1.数据节点汇报上来的可用资源是否大于等于配置的minimumAllocation.
      2.检查分配后队列的总占用资源是否超过队列的资源上限.
重新回到主体逻辑代码:

@Override
  public synchronized CSAssignment ParantQueue.assignContainers(Resource clusterResource,
      FiCaSchedulerNode node, ResourceLimits resourceLimits) {
   
    CSAssignment assignment = 
        new CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL);
    Set<String> nodeLabels = node.getLabels();
    
    /**
     * 	数据节点是否标签是否正匹配:
     * 		1.如果队列标签是*,则可以访问任何一个计算节点
     * 		2.如果节点没有打标签,则任何队列都可以访问
     * 		3.如果队列打了固定标签,则只能访问对应标签的节点
     */
    if (!SchedulerUtils.checkQueueAccessToNode(accessibleLabels, nodeLabels)) {
   
      return assignment;
    }
    /**
	   * 	检查node上的可用资源是否达到minimumAllocation要求
	   * 
	   * 	计算node上的资源是否可以用(是与minimumAllocation匹配),计算公式:node.getAvailableResource()-minimumAllocation>0
	   * 		1.如果DefaultResourceCalculator是直接用上述公式计算,不需要用到clusterResource
	   * 		2.如果DominantResourceCalculator是用资源占用率算的,则需要用到clusterResource
	  */
    while (canAssign(clusterResource, node)) {
   
      if (LOG.isDebugEnabled()) {
   
        LOG.debug("Trying to assign containers to child-queue of "
          + getQueueName());
      }
      /**
       * 	检查是否超过当前队列资源上限,即判断当前队列是否可分配
       */
      if (!super.canAssignToThisQue