Yarn上MRAppMaster组件详解以及任务资源申请、启动的源码分析

最新推荐文章于 2023-11-13 19:59:00 发布

午后的红茶meton

最新推荐文章于 2023-11-13 19:59:00 发布

阅读量2.1k

点赞数 1

分类专栏： Hadoop分析与理解文章标签： hadoop MRAppMaster 任务资源申请任务启动

本文链接：https://blog.csdn.net/u012151684/article/details/108240025

版权

Hadoop分析与理解专栏收录该内容

40 篇文章 18 订阅

订阅专栏

本文深入解析了YARN环境下MapReduce任务的调度与执行机制，重点阐述了MRAppMaster的角色与功能，包括资源申请、任务分配、状态监控及任务恢复等关键流程。同时，文章详细介绍了ContainerAllocator和ContainerLauncher如何协同工作，确保任务在YARN集群中的高效运行。

摘要由CSDN通过智能技术生成

MRAppMaster是MapReduce的ApplicationMaster实现，它使得MapReduce计算框架可以运行于YARN之上。在YARN中，MRAppMaster负责管理MapReduce作业的生命周期，包括创建MapReduce作业，向ResourceManager申请资源，与NodeManage通信要求其启动Container，监控作业的运行状态，当任务失败时重新启动任务等。

YARN中采用了基于事件驱动的异步并发编程模型，它通过事件将各个组件联系起来，并由一个中央事件调度器统一将各种事件分配给对应的事件处理器。在YARN中，每种组件是一种事件处理器，MRAppMaster在整个MapReduce任务中负责管理整个任务的生命周期。它是一个独立的进程org.apache.hadoop.mapreduce.v2.app.MRAppMaster#main，由AppClient向Yarn申请Container后启动。MRAppMaster是MapReduce对ApplicationMaster的实现，它让MapReduce任务能运行在Yarn上。它主要作用在于管理作业的生命周期：

作业的管理、作业的创建、初始化以及启动等；
向RM申请资源和再分配资源
Container的启动与释放
监控作业运行状态
作业恢复

当MRAppMaster启动时，它们会以服务的形式注册到MRAppMaster的中央事件调度器上，并告诉调度器它们处理的事件类型，这样，当出现某一种事件时，MRAppMaster会查询<事件，事件处理器>表，并将该事件分配给对应的事件处理器。其注册的事件以及事件处理器对应表如下：

其MRAppMaster的架构以及对应的各种组件/服务的功能如下：

ContainerAllocator：与ResourceManager通信，为作业申请资源。作业的每个任务资源需求可描述为四元组<Priority，hostname，capability，containers>，分别表示作业优先级、期望资源所在的host，资源量（当前仅支持内存），container数目。ContainerAllocator周期性通过RPC与ResourceManager通信，而ResourceManager会为之返回已经分配的container列表，完成的container列表等信息。
ContainerLauncher：与NodeManager通信，要求其启动一个Container。当ResourceManager为作业分配资源后，ContainerLauncher会将资源信息封装成container，包括任务运行所需资源、任务运行命令、任务运行环境、任务依赖的外部文件等，然后与对应的节点通信，要求其启动container。
Job：表示一个MapReduce作业，与MRv1的JobInProgress功能一样，负责监控作业的运行状态。它维护了一个作业状态机，以实现异步控制各种作业操作。
Task：表示一个MapReduce作业中的某个任务，与MRv1中的TaskInProgress功能类似，负责监控一个任务的运行状态。它维护了一个任务状态机，以实现异步控制各种任务操作。
TaskAttempt：表示一个任务运行实例，同MRv1中的概念一样。
Speculator：完成推测执行功能。当一个任务运行速度明显慢于其他任务时，Speculator会为该任务启动一个备份任务，让其同慢任务一同处理同一份数据，谁先计算完成则将谁的结果作为最终结果，另一个任务将被杀掉。该机制可有效防止“拖后腿”任务拖慢整个作业的执行进度。
TaskAttemptListener：管理各个任务的心跳信息，如果一个任务一段时间内未汇报心跳，则认为它死掉了，会将其从系统中移除。同MRv1中的TaskTracker类似，它实现了TaskUmbilicalProtocol协议，任务会通过该协议汇报心跳，并询问是否能够提交最终结果。
JobHistoryEventHandler：对作业的各个事件记录日志，比如作业创建、作业开始运行、一个任务开始运行等，这些日志会被写到HDFS的某个目录下，这对于作业恢复非常有用。当MRAppMaster出现故障时，YARN会将其重新调度到另外一个节点上，为了避免重新计算，MRAppMaster首先会从HDFS上读取上次运行产生的运行日志，以恢复已经运行完成的任务，进而能够只运行尚未运行完成的任务。
ClientService：ClientService是一个接口，由MRClientService实现。MRClientService实现了MRClientProtocol协议，客户端可通过该协议获取作业的执行状态（而不必通过ResourceManager）控制作业（比如杀死作业等）。

MRAppMaster工作流程如下：

用户向YARN中（RM）提交应用程序，其中包括ApplicationMaster程序、启动ApplicationMaster的命令、用户程序等。
ResourceManager为该应用程序分配第一个Container，ResouceManag与某个NodeManager通信，启动应用程序ApplicationMaster，NodeManager接到命令后，首先从HDFS上下载文件(缓存），然后启动ApplicationMaser。
当ApplicationMaster启动后，它与ResouceManager通信，以请求和获取资源。ApplicationMaster获取到资源后，与对应NodeManager通信以启动任务。
（如果该应用程序第一次在节点上启动任务，则NodeManager首先从HDFS上下载文件缓存到本地，然后启动该任务。）
ApplicationMaster首先向ResourceManager注册，这样用户可以直接通过ResourceManage查看应用程序的运行状态，然后它将为各个任务申请资源，并监控它们的运行状态，直到运行结束，即重复步骤5~8
ApplicationMaster采用轮询的方式通过RPC协议向ResourceManager申请和领取资源
一旦ApplicationMaster申请到资源后，ApplicationMaster就会将启动命令交给NodeManager，要求它启动任务。启动命令里包含了一些信息使得Container可以与ApplicationMaster进行通信。
NodeManager为任务设置好运行环境（包括环境变量、JAR包、二进制程序等）后，将任务启动命令写到一个脚本中，并通过运行该脚本启动任务（Container）
在应用程序运行过程中，用户可随时通过RPC向ApplicationMaster查询应用程序的当前运行状态
应用程序运行完成后，ApplicationMaster向ResourceManager注销并关闭自己

接下来详细分析在MRAppMaster中的ContainerAllocator与ContainerLauncher资源申请与启动container的源码流程：

ContainerAllocator

在MRAppMaster对象实例构造初始化过程中，其会调用父类的构造函数及serviceInit()方法启动对应的组合service服务。在MRAppMaster的构造函数中，比较注意点的是其构造注册了上述所说的事件及事件处理器，以及我们现在要详细讨论的ContainerAllocator与ContainerLauncher；其构造初始化过程以及serviceInit()方法如下：

containerAllocator = createContainerAllocator(clientService, context);
dispatcher.register(ContainerAllocator.EventType.class, containerAllocator);

// corresponding service to launch allocated containers via NodeManager
containerLauncher = createContainerLauncher(context);
dispatcher.register(ContainerLauncher.EventType.class, containerLauncher);

// register the event dispatchers
dispatcher.register(JobEventType.class, jobEventDispatcher);
dispatcher.register(TaskEventType.class, new TaskEventDispatcher());
dispatcher.register(TaskAttemptEventType.class, new TaskAttemptEventDispatcher());
dispatcher.register(CommitterEventType.class, committerEventHandler);


private final class ContainerAllocatorRouter extends AbstractService
    implements ContainerAllocator, RMHeartbeatHandler {
  private final ClientService clientService;
  private final AppContext context;
  private ContainerAllocator containerAllocator; // 真正用于与RM通信申请资源的组件

  ContainerAllocatorRouter(ClientService clientService,
      AppContext context) {
    super(ContainerAllocatorRouter.class.getName());
    this.clientService = clientService;
    this.context = context;
  }

  @Override
  protected void serviceStart() throws Exception {
    if (job.isUber()) { // 本地uber小任务模式，map、reduce任务都在同一个container中完成
      MRApps.setupDistributedCacheLocal(getConfig());
      this.containerAllocator = new LocalContainerAllocator(
          this.clientService, this.context, nmHost, nmPort, nmHttpPort
          , containerID);
    } else { 
      this.containerAllocator = new RMContainerAllocator(
          this.clientService, this.context);
    }
    // 调子类及父类的相关函数完成服务的初始化并启动
    // 主要核心包括调用父类的RMCommunicator.serviceStart()方法，
    // 构造与ResourceManager通信的RPC协议ApplicationMasterProtocol的客户端代理
    // 并启动对应的资源申请线程，周期性执行heartbeat()方法，与ResourceManager通信并申请资源
    ((Service)this.containerAllocator).init(getConfig());
    ((Service)this.containerAllocator).start();
    super.serviceStart();
  }
}


// RMCommunicator#serviceStart()
@Override
protected void serviceStart() throws Exception {
  scheduler= createSchedulerProxy(); // 构造与RM通信的RPC Client代理
  JobID id = TypeConverter.fromYarn(this.applicationId);
  JobId jobId = TypeConverter.toYarn(id);
  job = context.getJob(jobId);
  register(); // 向RM中的ApplicationMasterService注册
  startAllocatorThread(); // 启动周期性线程，申请对应的资源
  super.serviceStart(); 
}

从上可知，在ContainerAllocator对象实例构造及启动过程中，其会构造与ResourceManager通信的RPC协议ApplicationMasterProtocol的客户端代理，以及启动对应的资源申请线程，周期性的执行heartbeat()方法，并与ResourceManager通信并申请资源；其中在serviceStart()的过程中可以看到其MRAppMaster在启动后会调用register()函数向RM进行注册，其注册的基本方法也就是使用构造的RPC Client代理向RM发送registerApplicationMaster()函数。其RM侧的ApplicationMasterService服务会响应该注册请求，并将其添加至AMLivelinessMonitor对象中，以便根据周期性的心跳来检测该MRAppMaster是否存活(注册过程在之前的博客中已有简单的分析，在此不再赘述)。接下来我们将分析下周期性的allocatorThread资源申请线程所执行的heartbeat()方法如下：

@Override
protected synchronized void heartbeat() throws Exception {
  scheduleStats.updateAndLogIfChanged("Before Scheduling: ");
  List<Container> allocatedContainers = getResources(); // 获取申请到的container资源
  
  // 接下来将对获取到的资源进行任务内的二次分配，
  // 其会按照container的资源以及对应的任务优先级来进行资源的二次分配
  if (allocatedContainers != null && allocatedContainers.size() > 0) {
    scheduledRequests.assign(allocatedContainers);
  }
  // ......
}

其中getResources()方法为资源申请的主要方法，在方法内部其会使用RPC Client客户端代理调用AllocateResponse allocateResponse = ApplicationMasterProtocol.allocate(allocateRequest)；方法向RM上申请对应的资源。在申请资源的响应处理中，其会更新当前Node的状态以及处理异常的map和reduce任务，并对已经完成的container列表进行处理，并返回在获取到的申请资源中的新的container资源列表，其会调用scheduledRequests.assign(allocatedContainers)；将申请到的container资源按照策略二次分配给内部的具体任务如下：

// this method will change the list of allocatedContainers.
private void assign(List<Container> allocatedContainers) {
  Iterator<Container> it = allocatedContainers.iterator();
  LOG.info("Got allocated containers " + allocatedContainers.size());
  containersAllocated += allocatedContainers.size();
  while (it.hasNext()) {
    // while循环内都是在判断该container是否满足资源限制以及是否不在黑名单节点上等条件，
    // 如果container所在的Node节点在黑名单上，则其会寻找一个与该container相匹配的任务，并重新为其申请资源
    // 该处while内会将不满足条件的container从该列表中去掉，添加到release列表中，
    // 并在下一次心跳请求过程中，返回给RM进行对应container资源的释放
    Container allocated = it.next();
    // check if allocated container meets memory requirements 
    // and whether we have any scheduled tasks that need 
    // a container to be assigned
    boolean isAssignable = true;
    Priority priority = allocated.getPriority();
    Resource allocatedResource = allocated.getResource();
    if (PRIORITY_FAST_FAIL_MAP.equals(priority) 
        || PRIORITY_MAP.equals(priority)) {
      if (ResourceCalculatorUtils.computeAvailableContainers(allocatedResource,
          mapResourceRequest, getSchedulerResourceTypes()) <= 0
          || maps.isEmpty()) {
        isAssignable = false; 
      }
    } 
    else if (PRIORITY_REDUCE.equals(priority)) {
      if (ResourceCalculatorUtils.computeAvailableContainers(allocatedResource,
          reduceResourceRequest, getSchedulerResourceTypes()) <= 0
          || reduces.isEmpty()) {
        isAssignable = false;
      }
    } else {
      LOG.warn("Container allocated at unwanted priority: " + priority + 
          ". Returning to RM...");
      isAssignable = false;
    }
    
    if(!isAssignable) {
      // release container if we could not assign it 
      containerNotAssigned(allocated);
      it.remove();
      continue;
    }
    
    // do not assign if allocated container is on a  
    // blacklisted host
    String allocatedHost = allocated.getNodeId().getHost();
    if (isNodeBlacklisted(allocatedHost)) {
      // we need to request for a new container 
      // and release the current one
      // find the request matching this allocated container 
      // and replace it with a new one 
      // ......
      // release container if we could not assign it 
      containerNotAssigned(allocated);
      it.remove();
      continue;
    }
  }

  // 此处用于 可用的container进行实际任务的二次分配
  assignContainers(allocatedContainers);
   
  // release container if we could not assign it 
  // 最终将没有分配的container添加到需要release的列表中，并在下一次心跳请求过程中
  // 返回给RM进行对应container资源的释放
  it = allocatedContainers.iterator();
  while (it.hasNext()) {
    Container allocated = it.next();
    LOG.info("Releasing unassigned and invalid container " 
        + allocated + ". RM may have assignment issues");
    containerNotAssigned(allocated);
  }
}

在分配的过程中可知，首先其会对申请到的container资源进行判断，移除资源不足or黑名单node节点中的container，如果container所在的Node节点在黑名单上，则其会寻找一个与该container相匹配的任务，并重新为其申请资源；再将剩下的container进行内部任务的二次分配，并将没有分配的container，在下一次的心跳信息中返回给RM，使其释放该未分配的container资源。其内部的assignContainers(allocatedContainers)具体的二次分配过程如下：

private void assignContainers(List<Container> allocatedContainers) {
  Iterator<Container> it = allocatedContainers.iterator();
  // 在二次分配的过程中；其首先会根据container的优先级，
  // 优先将其分配给failed的MapTask、之后再是Reduce任务、最后才是分配给对应的正常的MapTask任务，、
  // 在最后的正常MapTask任务的分配中，其会根据任务本地性的原则：
  // 优先分配node-local(数据与container在同一个节点)
  // rack-local(数据与container在同一机架)、
  // no-local(数据与container不在同一个机架)的顺序方式来进行map任务的分配
  while (it.hasNext()) {
    Container allocated = it.next();
    ContainerRequest assigned = assignWithoutLocality(allocated);
    if (assigned != null) {
      containerAssigned(allocated, assigned);
      it.remove();
    }
  }

  assignMapsWithLocality(allocatedContainers);
}

可以知道在二次分配的过程中；其首先会根据container的优先级，优先将其分配给failed的MapTask、之后再是Reduce任务、最后才是分配给对应的正常的MapTask任务；在最后的正常MapTask任务的分配中，其会根据任务本地性的原则：

优先分配node-local(数据与container在同一个节点)
rack-local(数据与container在同一机架)
no-local(数据与container不在同一个机架)的顺序方式来进行map任务的分配

其对应的详细分配逻辑在assignWithoutLocality()以及assignMapsWithLocality()方法中，

private ContainerRequest assignWithoutLocality(Container allocated) {
  ContainerRequest assigned = null;
  
  Priority priority = allocated.getPriority();
  if (PRIORITY_FAST_FAIL_MAP.equals(priority)) { // 优先分配failed的MapTask
    LOG.info("Assigning container " + allocated + " to fast fail map");
    assigned = assignToFailedMap(allocated);
  } else if (PRIORITY_REDUCE.equals(priority)) { // 其次分配给Reduce任务
    if (LOG.isDebugEnabled()) {
      LOG.debug("Assigning container " + allocated + " to reduce");
    }
    assigned = assignToReduce(allocated);
  }
  return assigned;
}

private void assignMapsWithLocality(List<Container> allocatedContainers) {
  // try to assign to all nodes first to match node local
  // 按照任务本地性的原则进行分配，
  // 从对应的本地mapsHostMapping、机架mapsRackMapping、maps任务取出任务进行分配
  Iterator<Container> it = allocatedContainers.iterator();
  while(it.hasNext() && maps.size() > 0){
    // ......
    Container allocated = it.next();     
    LinkedList<TaskAttemptId> list = mapsHostMapping.get(host);
    while (list != null && list.size() > 0) {
      // ......
      containerAssigned(allocated, assigned);
    }
  }
  
  // try to match all rack local
  it = allocatedContainers.iterator();
  while(it.hasNext() && maps.size() > 0){
    // ......
    Container allocated = it.next();
    LinkedList<TaskAttemptId> list = mapsRackMapping.get(rack);
    while (list != null && list.size() > 0) {
      // ......
      containerAssigned(allocated, assigned);
    }
  }
  
  // assign remaining
  it = allocatedContainers.iterator();
  while(it.hasNext() && maps.size() > 0){
    // ......
    Container allocated = it.next();
    TaskAttemptId tId = maps.keySet().iterator().next();
    containerAssigned(allocated, assigned);
  }
}

ContainerLauncher

在ContainerAllocator对象申请到container资源并通过函数containerAssigned()将其分配给具体的任务之后，其会调度TaskAttemptContainerAssignedEvent事件告知该TaskAttemptImpl已经分配到container，并触发其状态机进行如下转移：

.addTransition(TaskAttemptStateInternal.UNASSIGNED,
    TaskAttemptStateInternal.ASSIGNED, TaskAttemptEventType.TA_ASSIGNED,
    new ContainerAssignedTransition())

其会执行对应的hook函数ContainerAssignedTransition()；该函数执行内部会构造需要运行的具体MapTask or ReduceTask任务；并构造对应的ContainerLaunchContext；并调度通知ContainerLaunch去调度该ContainerRemoteLaunchEvent事件。

private static class ContainerAssignedTransition implements
    SingleArcTransition<TaskAttemptImpl, TaskAttemptEvent> {
  @SuppressWarnings({ "unchecked" })
  @Override
  public void transition(final TaskAttemptImpl taskAttempt, 
      TaskAttemptEvent event) {
    final TaskAttemptContainerAssignedEvent cEvent = 
      (TaskAttemptContainerAssignedEvent) event;
    Container container = cEvent.getContainer();
    taskAttempt.container = container;
    // this is a _real_ Task (classic Hadoop mapred flavor):
    // 此处会创建用于实际运行的MapTask or ReduceTask任务
    taskAttempt.remoteTask = taskAttempt.createRemoteTask();
 
    //launch the container
    //create the container object to be launched for a given Task attempt
    // 构造ContainerLaunchContext启动上下文，并通知ContainerLaunch去调度ContainerRemoteLaunchEvent事件
    // 通知对应的NodeMAnager来启动对应的container任务
    ContainerLaunchContext launchContext = createContainerLaunchContext(
        cEvent.getApplicationACLs(), taskAttempt.conf, taskAttempt.jobToken,
        taskAttempt.remoteTask, taskAttempt.oldJobId, taskAttempt.jvmID,
        taskAttempt.taskAttemptListener, taskAttempt.credentials);
    taskAttempt.eventHandler
      .handle(new ContainerRemoteLaunchEvent(taskAttempt.attemptId,
        launchContext, container, taskAttempt.remoteTask));

  }
}

在构造ContainerLaunchContext对应的上下文过程中，比较重要的是其所构建的启动具体container的任务命令cmds，在createContainerLaunchContext()方法中，其调用MapReduceChildJVM.getVMCommand()来构造具体的启动指令：

// Set up the launch command
List<String> commands = MapReduceChildJVM.getVMCommand(
    taskAttemptListener.getAddress(), remoteTask, jvmID);

public static List<String> getVMCommand(
    InetSocketAddress taskAttemptListenerAddr, Task task, 
    JVMId jvmID) {

  TaskAttemptID attemptID = task.getTaskID();
  JobConf conf = task.conf;

  Vector<String> vargs = new Vector<String>(8);

  vargs.add(MRApps.crossPlatformifyMREnv(task.conf, Environment.JAVA_HOME)
      + "/bin/java");

  // Add child (task) java-vm options.
  //
  // The following symbols if present in mapred.{map|reduce}.child.java.opts 
  // value are replaced:
  // + @taskid@ is interpolated with value of TaskID.
  // Other occurrences of @ will not be altered.
  //
  // Example with multiple arguments and substitutions, showing
  // jvm GC logging, and start of a passwordless JVM JMX agent so can
  // connect with jconsole and the likes to watch child memory, threads
  // and get thread dumps.
  //
  //  <property>
  //    <name>mapred.map.child.java.opts</name>
  //    <value>-Xmx 512M -verbose:gc -Xloggc:/tmp/@taskid@.gc \
  //           -Dcom.sun.management.jmxremote.authenticate=false \
  //           -Dcom.sun.management.jmxremote.ssl=false \
  //    </value>
  //  </property>
  //
  //  <property>
  //    <name>mapred.reduce.child.java.opts</name>
  //    <value>-Xmx 1024M -verbose:gc -Xloggc:/tmp/@taskid@.gc \
  //           -Dcom.sun.management.jmxremote.authenticate=false \
  //           -Dcom.sun.management.jmxremote.ssl=false \
  //    </value>
  //  </property>
  //
  String javaOpts = getChildJavaOpts(conf, task.isMapTask());
  javaOpts = javaOpts.replace("@taskid@", attemptID.toString());
  String [] javaOptsSplit = javaOpts.split(" ");
  for (int i = 0; i < javaOptsSplit.length; i++) {
    vargs.add(javaOptsSplit[i]);
  }

  Path childTmpDir = new Path(MRApps.crossPlatformifyMREnv(conf, Environment.PWD),
      YarnConfiguration.DEFAULT_CONTAINER_TEMP_DIR);
  vargs.add("-Djava.io.tmpdir=" + childTmpDir);

  // Setup the log4j prop
  long logSize = TaskLog.getTaskLogLength(conf);
  setupLog4jProperties(task, vargs, logSize, conf);

  if (conf.getProfileEnabled()) {
    if (conf.getProfileTaskRange(task.isMapTask()
                                 ).isIncluded(task.getPartition())) {
      final String profileParams = conf.get(task.isMapTask()
          ? MRJobConfig.TASK_MAP_PROFILE_PARAMS
          : MRJobConfig.TASK_REDUCE_PROFILE_PARAMS, conf.getProfileParams());
      vargs.add(String.format(profileParams,
          getTaskLogFile(TaskLog.LogName.PROFILE)));
    }
  }

  // Add main class and its arguments 
  vargs.add(YarnChild.class.getName());  // main of Child
  // pass TaskAttemptListener's address
  vargs.add(taskAttemptListenerAddr.getAddress().getHostAddress()); 
  vargs.add(Integer.toString(taskAttemptListenerAddr.getPort())); 
  vargs.add(attemptID.toString());                      // pass task identifier

  // Finally add the jvmID
  vargs.add(String.valueOf(jvmID.getId()));
  vargs.add("1>" + getTaskLogFile(TaskLog.LogName.STDOUT));
  vargs.add("2>" + getTaskLogFile(TaskLog.LogName.STDERR));

  // Final commmand
  StringBuilder mergedCommand = new StringBuilder();
  for (CharSequence str : vargs) {
    mergedCommand.append(str).append(" ");
  }
  Vector<String> vargsFinal = new Vector<String>(1);
  vargsFinal.add(mergedCommand.toString());
  return vargsFinal;
}

可以看到其具体的启动命令形式以及运行的主类分别为：java jvmOpts tmpdir 主类YarnChild args 1> xxx.log 2> xxx.log；在这我们知道了具体的container任务在NodeManager上启动的具体指令，其会启动对应的YarnChild进程，该进程内会持有运行的具体的MapTask or ReduceTask任务，并将其Task任务在YarnChild进程中运行。

接着上面的ContainerLaunch去调度处理ContainerRemoteLaunchEvent事件；其会使用ContainerLaunchImpl实例来处理该事件，其会在handle()函数[生产-消费者模式]中简单的添加到保存ContainerLauncherEvent事件的阻塞队列eventQueue中，并等待ContainerLaunchImpl内部的eventHandlingThread线程去使用线程池的方式并行的执行该具体事件：

launcherPool.execute(createEventProcessor(event));

其最终会交给对应的container去执行CONTAINER_REMOTE_LAUNCH事件；

public void run() {
  // Load ContainerManager tokens before creating a connection.
  // TODO: Do it only once per NodeManager.
  ContainerId containerID = event.getContainerID();
  Container c = getContainer(event);
  switch(event.getType()) {

  // 使用对应的Container来执行具体的事件
  case CONTAINER_REMOTE_LAUNCH:
    ContainerRemoteLaunchEvent launchEvent
        = (ContainerRemoteLaunchEvent) event;
    c.launch(launchEvent);
    break;
  case CONTAINER_REMOTE_CLEANUP:
    c.kill();
    break;
  }
  removeContainerIfDone(containerID);
}

public synchronized void launch(ContainerRemoteLaunchEvent event) {
  // 构造ContainerManagementProtocol协议的RPC客户端
  ContainerManagementProtocolProxyData proxy = null;
  try {
    proxy = getCMProxy(containerMgrAddress, containerID);
    
    // Construct the actual Container
    ContainerLaunchContext containerLaunchContext =
      event.getContainerLaunchContext();

    // Now launch the actual container
    StartContainerRequest startRequest =
        StartContainerRequest.newInstance(containerLaunchContext,
          event.getContainerToken());
    List<StartContainerRequest> list = new ArrayList<StartContainerRequest>();
    list.add(startRequest);
    StartContainersRequest requestList = StartContainersRequest.newInstance(list);
    
    // rpc远程调用startContainers()与对应的NodeManager通信启动该container
    StartContainersResponse response =
        proxy.getContainerManagementProtocol().startContainers(requestList);
    
    // after launching, send launched event to task attempt to move
    // it from ASSIGNED to RUNNING state
    // 在启动过后，会通知TaskAttemptImpl调度TaskAttemptContainerLaunchedEvent事件；
    // 将其状态机状态从ASSIGNED转移到RUNNING状态
    context.getEventHandler().handle(
        new TaskAttemptContainerLaunchedEvent(taskAttemptID, port));
    this.state = ContainerState.RUNNING;
  } catch (Throwable t) {
    // ......
  }
}

最后其会调用rpc协议接口ContainerManagementProtocol.startContainer()与对应的NodeManager通信，以启动一个container。在YARN中，运行Task所需的全部信息被封装到Container中，包括所需资源、依赖的外部文件、jar包、运行时环境变量、运行命令等。最终NodeManager上的ContainerManager接受到该请求调用，并执行container的启动流程，NodeManager上的container启动流程已经在NodeManager详细组件及功能上有详细的讲述，在此不再赘述。

生命周期

我们知道在MRAppMaster中的job任务，都由若干个Map Task和Reduce Task组成，每个Task进一步由若干个TaskAttempt组成，在MRAppMaster中，其将Job、Task和TaskAttempt的生命周期均由一个状态机来表示：

其MRAppMaster会在构造初始化的时候构造JobImpl状态机来表示当前Job任务的状态流转：

public class MRAppMaster extends CompositeService {
  protected void serviceStart() throws Exception {
    // ...... 创建Job状态机
    // /// Create the job itself.
    job = createJob(getConfig(), forcedState, shutDownMessage);

    JobEvent initJobEvent = new JobEvent(job.getID(), JobEventType.JOB_INIT); 
    jobEventDispatcher.handle(initJobEvent);

    //start all the components
    super.serviceStart();
  }
 
  protected Job createJob(Configuration conf, JobStateInternal forcedState, String diagnostic) {
    // create single job
    Job newJob =
        new JobImpl(jobId, appAttemptID, conf, dispatcher.getEventHandler(),
            taskAttemptListener, jobTokenSecretManager, jobCredentials, clock,
            completedTasksFromPreviousRun, metrics,
            committer, newApiCommitter,
            currentUser.getUserName(), appSubmitTime, amInfos, context, 
            forcedState, diagnostic);
    ((RunningAppContext) context).jobs.put(newJob.getID(), newJob);
    dispatcher.register(JobFinishEvent.Type.class,
        createJobFinishEventHandler());     
    return newJob;
  }
}

JobImpl会接收到JOB_INIT事件，然后触发作业状态从NEW变为INITED，并触发函数InitTransition()，该函数会创建MapTask和 ReduceTask对应的Task任务状态机：

public static class InitTransition 
      implements MultipleArcTransition<JobImpl, JobEvent, JobState> {
  ...
  createMapTasks(job, inputLength, taskSplitMetaInfo);
  createReduceTasks(job);
  ...
}

在作业启动中，JobImpl会接收到JOB_START事件，触发StartTransition()函数，其内会调度CommitterEventType.JOB_SETUP事件，该事件将由CommitterEventHandler进行处理，最终其会调度JobEventType.JOB_SETUP_COMPLETED事件将JobImpl状态机进行状态从SETUP变为RUNNING的转换，并触发函数SetupCompletedTransition()，进而触发Map Task和Reduce Task状态机进行对应的状态转移：

private static class SetupCompletedTransition
    implements SingleArcTransition<JobImpl, JobEvent> {
  @Override
  public void transition(JobImpl job, JobEvent event) {
    job.scheduleTasks(job.mapTasks, job.numReduceTasks == 0);
    job.scheduleTasks(job.reduceTasks, true);
    // ......
  }
}

这之后，所有Map Task和Reduce Task各自负责各自的状态变化，ContainerAllocator模块会首先为Map Task申请资源，然后是Reduce Task，一旦一个Task获取到了资源，则会创建一个运行实例TaskAttempt，如果该实例运行成功，则Task运行成功，否则，Task还会启动下一个运行实例TaskAttempt，直到一个TaskAttempt运行成功或者达到尝试次数上限。当所有Task运行成功后，Job运行成功。一个运行成功的任务所经历的状态变化如下（不包含失败或者被杀死情况）：

值得注意的是，TaskAttempt是触发具体Map、Reduce任务在YarnChild进程中执行的触发调度者。在TaskAttempt的状态机流转过程中，其从New到UNASSIGNED的状态会触发ContainerRequestEvent事件，该事件将由ContainerAllocator进行资源申请的处理(其按照对应的任务类型将其划分为Map资源和Reduce资源，并将其加入到ask列表中，该ask列表会在周期性的heartbeat()中向RM申请对应的资源)，在申请到对应的资源后，其将对应的container资源分配给具体的TaskAttempt任务后，其状态将由UNASSIGNED状态转移到ASSIGNED状态(此处也即是对应上文的ContainerAllocator资源申请与分配的过程)；在TaskAttempt分配资源后，其会委托ContainerLaunch通过RPC的方式与对应的NodeManager通信，要求其启动对应的container，也即是启动对应的YarnChild进程执行具体的task任务，之后TaskAttempt将会把状态有ASSIGNED状态转化为Running状态(也即是ContainerLauncher的container任务启动过程)。