YARN源码分析—AM-RM通信协议，获得资源

最新推荐文章于 2024-03-28 10:57:21 发布

置顶 hahachenchen789

最新推荐文章于 2024-03-28 10:57:21 发布

阅读量3.3k

点赞数

分类专栏： hadoop Hadoop与Kubernetes学习

本文链接：https://blog.csdn.net/hahachenchen789/article/details/81065187

版权

Hadoop与Kubernetes学习同时被 2 个专栏收录

65 篇文章 6 订阅

订阅专栏

hadoop

41 篇文章 2 订阅

订阅专栏

在上几篇博文中分析了YARN调度模拟器SLS的源码，重点分析了AM与RM之间的通信协议。

接下来分析在YARN项目中，AM-RM通信如何实现的。

注意点：在YARN中，真正已经实现的只有RM和NM，而AM和client只是提供了api，需要用户自行实现。

而AM的主要功能是根据业务需求，从RM处申请资源，并利用这些资源完成业务逻辑，因此AM需要跟RM通信，也需要跟NM通信。

通信协议：

AM-RM：ApplicationMasterProtocol

AM-NM： ContainerManagementProtocol

这两个协议的定义在hadoop-yarn-api中

AM-RM协议：

AM与RM通过ApplicationMasterProtocol协议进行通信，该协议提供了几种方法：

1.向RM注册AM

public RegisterApplicationMasterResponse registerApplicationMaster(
      RegisterApplicationMasterRequest request) 
  throws YarnException, IOException;

2.告诉RM，应用结束

 public FinishApplicationMasterResponse finishApplicationMaster(
      FinishApplicationMasterRequest request) 
  throws YarnException, IOException;

3.向RM请求资源

  public AllocateResponse allocate(AllocateRequest request) 
  throws YarnException, IOException;

客户端（client）向RM提交应用后，RM会分配一定的资源启动AM，AM启动后调用ApplicationMasterProtocol的registerApplicationMaster方法向RM注册自己。完成注册后，调用allocate方法向RM申请运行任务的资源。获取资源后，通过与NM的通信协议：ContainerManagementProtocol启动资源容器，完成任务。完成后，通过ApplicationMasterProtocol的finishApplicationMaster方法向RM汇报应用结束，并注销AM。

接下来详细看下AM向RM请求资源的过程：

1.AM向RM注册

AM通过allocate方法向RM申请或释放资源。信息被封装为AllocateRequest里。

举个例子：hadoop实现了MR的例子。

org.apache.hadoop.mapreduce.v2.app.MRAppMaster.java

在serviceinit函数中，首先创建job也就是任务。

@Override
  protected void serviceInit(final Configuration conf) throws Exception {
    // create the job classloader if enabled
    createJobClassLoader(conf);

    initJobCredentialsAndUGI(conf);
......
      if(!stagingExists) {
        isLastAMRetry = true;
        LOG.info("Attempt num: " + appAttemptID.getAttemptId() +
            " is last retry: " + isLastAMRetry +
            " because the staging dir doesn't exist.");
        errorHappenedShutDown = true;
        forcedState = JobStateInternal.ERROR;
        shutDownMessage = "Staging dir does not exist " + stagingDir;
        LOG.error(shutDownMessage);
      } else if (commitStarted) {
        //A commit was started so this is the last time, we just need to know
        // what result we will use to notify, and how we will unregister
        errorHappenedShutDown = true;
        isLastAMRetry = true;
        LOG.info("Attempt num: " + appAttemptID.getAttemptId() +
            " is last retry: " + isLastAMRetry +
            " because a commit was started.");
        copyHistory = true;
        if (commitSuccess) {
          shutDownMessage =
              "Job commit succeeded in a prior MRAppMaster attempt " +
              "before it crashed. Recovering.";
          forcedState = JobStateInternal.SUCCEEDED;
        } else if (commitFailure) {
          shutDownMessage =
              "Job commit failed in a prior MRAppMaster attempt " +
              "before it crashed. Not retrying.";
          forcedState = JobStateInternal.FAILED;
        } else {
          if (isCommitJobRepeatable()) {
            // cleanup previous half done commits if committer supports
            // repeatable job commit.
            errorHappenedShutDown = false;
            cleanupInterruptedCommit(conf, fs, startCommitFile);
          } else {
            //The commit is still pending, commit error
            shutDownMessage =
                "Job commit from a prior MRAppMaster attempt is " +
                "potentially in progress. Preventing multiple commit executions";
            forcedState = JobStateInternal.ERROR;
          }
        }
      }
......
      // service to allocate containers from RM (if non-uber) or to fake it (uber)
      containerAllocator = createContainerAllocator(null, context);
      addIfService(containerAllocator);
      dispatcher.register(ContainerAllocator.EventType.class, containerAllocator);

然后最后从RM分配容器资源。通过调用createContainerAllocator函数实现。

该函数实现如下：

  protected ContainerAllocator createContainerAllocator(
      final ClientService clientService, final AppContext context) {
    return new ContainerAllocatorRouter(clientService, context);
  }

new了ContainerAllocatorRouter类并返回。

ContainerAllocatorRouter实现如下：

  private final class ContainerAllocatorRouter extends AbstractService
      implements ContainerAllocator, RMHeartbeatHandler {
    private final ClientService clientService;
    private final AppContext context;
    private ContainerAllocator containerAllocator;

    ContainerAllocatorRouter(ClientService clientService,
        AppContext context) {
      super(ContainerAllocatorRouter.class.getName());
      this.clientService = clientService;
      this.context = context;
    }

    @Override
    protected void serviceStart() throws Exception {
      if (job.isUber()) {
        MRApps.setupDistributedCacheLocal(getConfig());
        this.containerAllocator = new LocalContainerAllocator(
            this.clientService, this.context, nmHost, nmPort, nmHttpPort
            , containerID);
      } else {
        this.containerAllocator = new RMContainerAllocator(
            this.clientService, this.context, preemptionPolicy);
      }
      ((Service)this.containerAllocator).init(getConfig());
      ((Service)this.containerAllocator).start();
      super.serviceStart();
    }

    @Override
    protected void serviceStop() throws Exception {
      ServiceOperations.stop((Service) this.containerAllocator);
      super.serviceStop();
    }

    @Override
    public void handle(ContainerAllocatorEvent event) {
      this.containerAllocator.handle(event);
    }

    public void setSignalled(boolean isSignalled) {
      ((RMCommunicator) containerAllocator).setSignalled(isSignalled);
    }
    
    public void setShouldUnregister(boolean shouldUnregister) {
      ((RMCommunicator) containerAllocator).setShouldUnregister(shouldUnregister);
    }

    @Override
    public long getLastHeartbeatTime() {
      return ((RMCommunicator) containerAllocator).getLastHeartbeatTime();
    }

    @Override
    public void runOnNextHeartbeat(Runnable callback) {
      ((RMCommunicator) containerAllocator).runOnNextHeartbeat(callback);
    }
  }

ContainerAllocatorRouter返回后，serviceinit继续往下执行，将返回的该类服务加入到schedule中：

 addIfService(containerAllocator);

然后会启动该类的服务：serviceStart

  @Override
    protected void serviceStart() throws Exception {
      if (job.isUber()) {
        MRApps.setupDistributedCacheLocal(getConfig());
        this.containerAllocator = new LocalContainerAllocator(
            this.clientService, this.context, nmHost, nmPort, nmHttpPort
            , containerID);
      } else {
        this.containerAllocator = new RMContainerAllocator(
            this.clientService, this.context, preemptionPolicy);
      }
      ((Service)this.containerAllocator).init(getConfig());
      ((Service)this.containerAllocator).start();
      super.serviceStart();
    }

可以看到该类服务会判断job类型，是uber or not。uber是啥（hadoop针对小规模MR的本地模式，均在一个jvm中运行。可以理解为本地模式）。

如果不是uber，则new一个RMContainerAllocator，然后init和start这个类。

看下这个类：

org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.java

构造函数：用传入的参数进行成员变量初始化

  public RMContainerAllocator(ClientService clientService, AppContext context,
      AMPreemptionPolicy preemptionPolicy) {
    super(clientService, context);
    this.preemptionPolicy = preemptionPolicy;
    this.stopped = new AtomicBoolean(false);
    this.clock = context.getClock();
    this.assignedRequests = createAssignedRequests();
  }

(Service)this.containerAllocator).init(getConfig());中的init函数：

@Override
  protected void serviceInit(Configuration conf) throws Exception {
    super.serviceInit(conf);
    reduceSlowStart = conf.getFloat(
        MRJobConfig.COMPLETED_MAPS_FOR_REDUCE_SLOWSTART, 
        DEFAULT_COMPLETED_MAPS_PERCENT_FOR_REDUCE_SLOWSTART);
    maxReduceRampupLimit = conf.getFloat(
        MRJobConfig.MR_AM_JOB_REDUCE_RAMPUP_UP_LIMIT, 
        MRJobConfig.DEFAULT_MR_AM_JOB_REDUCE_RAMP_UP_LIMIT);
    maxReducePreemptionLimit = conf.getFloat(
        MRJobConfig.MR_AM_JOB_REDUCE_PREEMPTION_LIMIT,
        MRJobConfig.DEFAULT_MR_AM_JOB_REDUCE_PREEMPTION_LIMIT);
    reducerUnconditionalPreemptionDelayMs = 1000 * conf.getInt(
        MRJobConfig.MR_JOB_REDUCER_UNCONDITIONAL_PREEMPT_DELAY_SEC,
        MRJobConfig.DEFAULT_MR_JOB_REDUCER_UNCONDITIONAL_PREEMPT_DELAY_SEC);
    reducerNoHeadroomPreemptionDelayMs = conf.getInt(
        MRJobConfig.MR_JOB_REDUCER_PREEMPT_DELAY_SEC,
        MRJobConfig.DEFAULT_MR_JOB_REDUCER_PREEMPT_DELAY_SEC) * 1000;//sec -> ms
    maxRunningMaps = conf.getInt(MRJobConfig.JOB_RUNNING_MAP_LIMIT,
        MRJobConfig.DEFAULT_JOB_RUNNING_MAP_LIMIT);
    maxRunningReduces = conf.getInt(MRJobConfig.JOB_RUNNING_REDUCE_LIMIT,
        MRJobConfig.DEFAULT_JOB_RUNNING_REDUCE_LIMIT);
    RackResolver.init(conf);
    retryInterval = getConfig().getLong(MRJobConfig.MR_AM_TO_RM_WAIT_INTERVAL_MS,
                                MRJobConfig.DEFAULT_MR_AM_TO_RM_WAIT_INTERVAL_MS);
    mapNodeLabelExpression = conf.get(MRJobConfig.MAP_NODE_LABEL_EXP);
    reduceNodeLabelExpression = conf.get(MRJobConfig.REDUCE_NODE_LABEL_EXP);
    // Init startTime to current time. If all goes well, it will be reset after
    // first attempt to contact RM.
    retrystartTime = System.currentTimeMillis();
    this.scheduledRequests.setNumOpportunisticMapsPercent(
        conf.getInt(MRJobConfig.MR_NUM_OPPORTUNISTIC_MAPS_PERCENT,
            MRJobConfig.DEFAULT_MR_NUM_OPPORTUNISTIC_MAPS_PERCENT));
    LOG.info(this.scheduledRequests.getNumOpportunisticMapsPercent() +
        "% of the mappers will be scheduled using OPPORTUNISTIC containers");
  }

对所需容器的一些参数进行配置，比如心跳时间，map或reduce等。

((Service)this.containerAllocator).start()：start函数

  @Override
  protected void serviceStart() throws Exception {
    this.eventHandlingThread = new Thread() {
      @SuppressWarnings("unchecked")
      @Override
      public void run() {

        ContainerAllocatorEvent event;

        while (!stopped.get() && !Thread.currentThread().isInterrupted()) {
          try {
            event = RMContainerAllocator.this.eventQueue.take();
          } catch (InterruptedException e) {
            if (!stopped.get()) {
              LOG.error("Returning, interrupted : " + e);
            }
            return;
          }

          try {
            handleEvent(event);
          } catch (Throwable t) {
            LOG.error("Error in handling event type " + event.getType()
                + " to the ContainreAllocator", t);
            // Kill the AM
            eventHandler.handle(new JobEvent(getJob().getID(),
              JobEventType.INTERNAL_ERROR));
            return;
          }
        }
      }
    };
    this.eventHandlingThread.start();
    super.serviceStart();
  }

服务启动后，会进入心跳模式。

循环执行类中的heartbeat方法：

 @Override
  protected synchronized void heartbeat() throws Exception {
    scheduleStats.updateAndLogIfChanged("Before Scheduling: ");
    List<Container> allocatedContainers = getResources();
    if (allocatedContainers != null && allocatedContainers.size() > 0) {
      scheduledRequests.assign(allocatedContainers);
    }

    int completedMaps = getJob().getCompletedMaps();
    int completedTasks = completedMaps + getJob().getCompletedReduces();
    if ((lastCompletedTasks != completedTasks) ||
          (scheduledRequests.maps.size() > 0)) {
      lastCompletedTasks = completedTasks;
      recalculateReduceSchedule = true;
    }

    if (recalculateReduceSchedule) {
      boolean reducerPreempted = preemptReducesIfNeeded();

      if (!reducerPreempted) {
        // Only schedule new reducers if no reducer preemption happens for
        // this heartbeat
        scheduleReduces(getJob().getTotalMaps(), completedMaps,
            scheduledRequests.maps.size(), scheduledRequests.reduces.size(),
            assignedRequests.maps.size(), assignedRequests.reduces.size(),
            mapResourceRequest, reduceResourceRequest, pendingReduces.size(),
            maxReduceRampupLimit, reduceSlowStart);
      }

      recalculateReduceSchedule = false;
    }

    scheduleStats.updateAndLogIfChanged("After Scheduling: ");
  }

在该方法中，最重要的是第一行，获得资源：

List<Container> allocatedContainers = getResources();

获得资源后，加入到schedule的assign分配中：

   if (allocatedContainers != null && allocatedContainers.size() > 0) {
      scheduledRequests.assign(allocatedContainers);
    }

接下来逐段具体分析getResources函数：

 @SuppressWarnings("unchecked")
  private List<Container> getResources() throws Exception {
    applyConcurrentTaskLimits();

    // will be null the first time
    Resource headRoom = Resources.clone(getAvailableResources());
    AllocateResponse response;
    /*
     * If contact with RM is lost, the AM will wait MR_AM_TO_RM_WAIT_INTERVAL_MS
     * milliseconds before aborting. During this interval, AM will still try
     * to contact the RM.
     */
    try {
      response = makeRemoteRequest();
      // Reset retry count if no exception occurred.
      retrystartTime = System.currentTimeMillis();

这里最重要的就是response = makeRemoteRequest()，这是AM向RM通信索取资源的关键方法。

接着就是try后面进行catch捕捉异常

  } catch (ApplicationAttemptNotFoundException e ) {
      // This can happen if the RM has been restarted. If it is in that state,
      // this application must clean itself up.
      eventHandler.handle(new JobEvent(this.getJob().getID(),
        JobEventType.JOB_AM_REBOOT));
      throw new RMContainerAllocationException(
        "Resource Manager doesn't recognize AttemptId: "
            + this.getContext().getApplicationAttemptId(), e);
    } catch (ApplicationMasterNotRegisteredException e) {
      LOG.info("ApplicationMaster is out of sync with ResourceManager,"
          + " hence resync and send outstanding requests.");
      // RM may have restarted, re-register with RM.
      lastResponseID = 0;
      register();
      addOutstandingRequestOnResync();
      return null;
    } catch (InvalidLabelResourceRequestException e) {
      // If Invalid label exception is received means the requested label doesnt
      // have access so killing job in this case.
      String diagMsg = "Requested node-label-expression is invalid: "
          + StringUtils.stringifyException(e);
      LOG.info(diagMsg);
      JobId jobId = this.getJob().getID();
      eventHandler.handle(new JobDiagnosticsUpdateEvent(jobId, diagMsg));
      eventHandler.handle(new JobEvent(jobId, JobEventType.JOB_KILL));
      throw e;
    } catch (Exception e) {
      // This can happen when the connection to the RM has gone down. Keep
      // re-trying until the retryInterval has expired.
      if (System.currentTimeMillis() - retrystartTime >= retryInterval) {
        LOG.error("Could not contact RM after " + retryInterval + " milliseconds.");
        eventHandler.handle(new JobEvent(this.getJob().getID(),
                                         JobEventType.JOB_AM_REBOOT));
        throw new RMContainerAllocationException("Could not contact RM after " +
                                retryInterval + " milliseconds.");
      }
      // Throw this up to the caller, which may decide to ignore it and
      // continue to attempt to contact the RM.
      throw e;
    }

也就是捕捉response = makeRemoteRequest返回的异常，比如资源不够，资源分配出错等。

接着：

 Resource newHeadRoom = getAvailableResources();
 List<Container> newContainers = response.getAllocatedContainers();

这是将response（回复）中的已分配好的container资源拎出来赋给List<Container> newContainers。这就是AM所需要的资源。

接着：

   // Setting NMTokens
    if (response.getNMTokens() != null) {
      for (NMToken nmToken : response.getNMTokens()) {
        NMTokenCache.setNMToken(nmToken.getNodeId().toString(),
            nmToken.getToken());
      }
    }

    // Setting AMRMToken
    if (response.getAMRMToken() != null) {
      updateAMRMToken(response.getAMRMToken());
    }

    List<ContainerStatus> finishedContainers =
        response.getCompletedContainersStatuses();

    // propagate preemption requests
    final PreemptionMessage preemptReq = response.getPreemptionMessage();
    if (preemptReq != null) {
      preemptionPolicy.preempt(
          new PreemptionContext(assignedRequests), preemptReq);
    }

    if (newContainers.size() + finishedContainers.size() > 0
        || !headRoom.equals(newHeadRoom)) {
      //something changed
      recalculateReduceSchedule = true;
      if (LOG.isDebugEnabled() && !headRoom.equals(newHeadRoom)) {
        LOG.debug("headroom=" + newHeadRoom);
      }
    }

    if (LOG.isDebugEnabled()) {
      for (Container cont : newContainers) {
        LOG.debug("Received new Container :" + cont);
      }
    }

    //Called on each allocation. Will know about newly blacklisted/added hosts.
    computeIgnoreBlacklisting();

    handleUpdatedNodes(response);
    handleJobPriorityChange(response);
    // Handle receiving the timeline collector address and token for this app.
    MRAppMaster.RunningAppContext appContext =
        (MRAppMaster.RunningAppContext)this.getContext();
    if (appContext.getTimelineV2Client() != null) {
      appContext.getTimelineV2Client().
          setTimelineCollectorInfo(response.getCollectorInfo());
    }
    for (ContainerStatus cont : finishedContainers) {
      processFinishedContainer(cont);
    }
    return newContainers;
  }

进行一些处理后，返回newContainers。

接下来对getResources方法中调用到的makeRemoteRequest方法进行分析：

org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.java

makeRemoteRequest方法：

  protected AllocateRespo makeRemoteRequestnse () throws YarnException,
      IOException {
    applyRequestLimits();
    ResourceBlacklistRequest blacklistRequest =
        ResourceBlacklistRequest.newInstance(new ArrayList<String>(blacklistAdditions),
            new ArrayList<String>(blacklistRemovals));
    AllocateRequest allocateRequest =
        AllocateRequest.newInstance(lastResponseID,
          super.getApplicationProgress(), new ArrayList<ResourceRequest>(ask),
          new ArrayList<ContainerId>(release), blacklistRequest);
    AllocateResponse allocateResponse = scheduler.allocate(allocateRequest);
    lastResponseID = allocateResponse.getResponseId();
    availableResources = allocateResponse.getAvailableResources();
    lastClusterNmCount = clusterNmCount;
    clusterNmCount = allocateResponse.getNumClusterNodes();
    int numCompletedContainers =
        allocateResponse.getCompletedContainersStatuses().size();

    if (ask.size() > 0 || release.size() > 0) {
      LOG.info("getResources() for " + applicationId + ":" + " ask="
          + ask.size() + " release= " + release.size() + " newContainers="
          + allocateResponse.getAllocatedContainers().size()
          + " finishedContainers=" + numCompletedContainers
          + " resourcelimit=" + availableResources + " knownNMs="
          + clusterNmCount);
    }

    ask.clear();
    release.clear();

    if (numCompletedContainers > 0) {
      // re-send limited requests when a container completes to trigger asking
      // for more containers
      requestLimitsToUpdate.addAll(requestLimits.keySet());
    }

    if (blacklistAdditions.size() > 0 || blacklistRemovals.size() > 0) {
      LOG.info("Update the blacklist for " + applicationId +
          ": blacklistAdditions=" + blacklistAdditions.size() +
          " blacklistRemovals=" +  blacklistRemovals.size());
    }
    blacklistAdditions.clear();
    blacklistRemovals.clear();
    return allocateResponse;
  }

定义一个AllocateRequest对象，调用newInstance实例化，并加入了一个getApplicationProgress方法。

    AllocateRequest allocateRequest =
        AllocateRequest.newInstance(lastResponseID,
          super.getApplicationProgress(), new ArrayList<ResourceRequest>(ask),
          new ArrayList<ContainerId>(release), blacklistRequest);

这样就构造完成一个标准的allocateRequest对象。可以发送给RM了。

接着，调用scheduler.allocate将请求加入调度中，返回值为标准的RM返回格式AllocateResponse。

这里的scheduler.allocate(allocateRequest)是不是似曾相识的感觉？

这里就是最上面提到的ApplicationMasterProtocol协议的第三个方法：

public AllocateResponse allocate(AllocateRequest request)

AllocateResponse allocateResponse = scheduler.allocate(allocateRequest);

这就完成了AM向RM请求资源和回复资源。

然后将allocateResponse返回即可。

最后来详细分析下AllocateRequest类和AllocateResponse类。

AllocateRequest：AM向RM请求资源的标准包

org.apache.hadoop.yarn.api.protocolrecords.AllocateRequest.java

AllocateRequest包格式如下：

1. responseID，相应的ID，用于区分重复相应

2. appProgress，进程的进度

3. askList（List<ResourceRequest> resourceAsk），AM向RM请求的资源列表，是一个List<ResourceRequest> 对象。其中ResourceRequest是一个资源请求的详细参数，包括容器个数，容器容量，分配策略等。

hadoop-yarn-api/src/main/proto/yarn_protos.proto

ResourceRequest

message ResourceRequestProto {
  optional PriorityProto priority = 1;
  optional string resource_name = 2;
  optional ResourceProto capability = 3;
  optional int32 num_containers = 4;
  optional bool relax_locality = 5 [default = true];
  optional string node_label_expression = 6;
  optional ExecutionTypeRequestProto execution_type_request = 7;
  optional int64 allocation_request_id = 8 [default = -1];
}

4. resourceBlacklistRequest，要添加或者删除的资源黑名单

AllocateResponse类：RM向AM回复的资源包

org.apache.hadoop.yarn.api.protocolrecords.AllocateResponse.java

包内容包括：

1.responseId，回复的ID，避免重复响应

2.numClusterNodes，集群规模大小

3.completedContainersStatuses，已完成的容器状态列表

4. allocatedContainers，RM新分配的资源给AM，这些资源封装在Container类中，因此返回类型通常为List<Container>

Container组成：

org.apache.hadoop.yarn.api.records.Container.java

  public static Container newInstance(ContainerId containerId, NodeId nodeId,
      String nodeHttpAddress, Resource resource, Priority priority,
      Token containerToken, ExecutionType executionType) {
    Container container = Records.newRecord(Container.class);
    container.setId(containerId);
    container.setNodeId(nodeId);
    container.setNodeHttpAddress(nodeHttpAddress);
    container.setResource(resource);
    container.setPriority(priority);
    container.setContainerToken(containerToken);
    container.setExecutionType(executionType);
    return container;
  }

可以看到container组成包括：

a. container ID

b. Node ID

c. nodeHttpAddress：节点http的地址

d. resource：为Resource类，格式<mem, vcores>

e. priority：优先级

f. containerToken: 容器令牌

g. executionType: 容器运行的类型

5.updatedNodes，状态被更新过的所有节点列表，每个节点的更新信息存放在NodeReport类中，因此返回类型通常为List<NodeReport>

6.amCommand，RM给AM发送的控制命令，包括重连和关闭。

7.preemptionMessage，资源抢占信息，包括两部分：强制回收部分和可自主调配部分

8.nmTokens，AM与NM之间的通信令牌