Overview
RMAppAttempt state machine
图 1-1
APP_ACCEPTED Handle
RMAppAttempt 由RMApp创建并启动,向scheduler 提交靖求之后进入submited 状态。 scheduler 验证请求,并创建一个内部App对像并提交到queue,等待调度,向dispatcher 发送APP_ACCEPTED消息,最终该消息将由RMAppAttempt处理:(这里以CapacityScheduler为例)
FiCaSchedulerApp SchedulerApp =
new FiCaSchedulerApp(applicationAttemptId, user, queue,
queue.getActiveUsersManager(), rmContext);
// Submit to the queue
try {
queue.submitApplication(SchedulerApp, user, queueName);
} catch (AccessControlException ace) {
LOG.info("Failed to submit application " + applicationAttemptId +
" to queue " + queueName + " from user " + user, ace);
this.rmContext.getDispatcher().getEventHandler().handle(
new RMAppAttemptRejectedEvent(applicationAttemptId,
ace.toString()));
return;
}
applications.put(applicationAttemptId, SchedulerApp);
LOG.info("Application Submission: " + applicationAttemptId +
", user: " + user +
" queue: " + queue +
", currently active: " + applications.size());
rmContext.getDispatcher().getEventHandler().handle(
new RMAppAttemptEvent(applicationAttemptId,
RMAppAttemptEventType.APP_ACCEPTED));
收到该事件,状态机,会调用ScheduleTransition,将自己注册到执行等待队例,然后状态机进入scheduled状态,如果master是可管理的;
CONTAINER_ALLOCATED Handle
状态机进入该状态之后,系统将等待 NM node的下一次heartbeat消息,收到消之后,scheduler会检测该node的当前可用capacity,有capacity,将在该node上为App分配一个container 对像:
In LeafQueue
// Create the container if necessary
Container container =
getContainer(rmContainer, application, node, capability, priority);
// something went wrong getting/creating the container
if (container == null) {
LOG.warn("Couldn't get container for allocation!");
return Resources.none();
}
// Can we allocate a container on this node?
int availableContainers =
resourceCalculator.computeAvailableContainers(available, capability);
if (availableContainers > 0) {
// Allocate...
// Did we previously reserve containers at this 'priority'?
if (rmContainer != null){
unreserve(application, priority, node, rmContainer);
}
// Create container tokens in secure-mode
if (UserGroupInformation.isSecurityEnabled()) {
ContainerToken containerToken =
createContainerToken(application, container);
if (containerToken == null) {
// Something went wrong...
return Resources.none();
}
container.setContainerToken(containerToken);
}
// Inform the application
RMContainer allocatedContainer =
application.allocate(type, node, priority, request, container);
// Does the application need this resource?
if (allocatedContainer == null) {
return Resources.none();
}
// Inform the node
node.allocateContainer(application.getApplicationId(),
allocatedContainer);
第一个container 用来运行ApplicationMaster,
Container 分配成功之后,AppAttempt将向Scheduler请求已分配的container,并设定为Master container,
// Acquire the AM container from the scheduler.
Allocation amContainerAllocation = appAttempt.scheduler.allocate(
appAttempt.applicationAttemptId, EMPTY_CONTAINER_REQUEST_LIST,
EMPTY_CONTAINER_RELEASE_LIST);
// Set the masterContainer
appAttempt.setMasterContainer(amContainerAllocation.getContainers().get(
0));
然后通知 state Store 保存当前App状态,AppAttempt 进入ALLOCATE_SAVING状态 保存完成之后,AppAttempt会收到一个 ATTEMP_SAVED通知。
ATTEMP_SAVED Handle
状态机收到该事件之后,开始加载并启动container,使得master得以开始运行:
private void launchAttempt(){
// Send event to launch the AM Container
eventHandler.handle(new AMLauncherEvent(AMLauncherEventType.LAUNCH, this));
}
private void launch() throws IOException {
connect();
ContainerId masterContainerID = masterContainer.getId();
ApplicationSubmissionContext applicationContext =
application.getSubmissionContext();
LOG.info("Setting up container " + masterContainer
+ " for AM " + application.getAppAttemptId());
ContainerLaunchContext launchContext =
createAMContainerLaunchContext(applicationContext, masterContainerID);
StartContainerRequest request =
recordFactory.newRecordInstance(StartContainerRequest.class);
request.setContainerLaunchContext(launchContext);
request.setContainer(masterContainer);
containerMgrProxy.startContainer(request);
LOG.info("Done launching container " + masterContainer
+ " for AM " + application.getAppAttemptId());
}
LAUNCHE 成功之后,会收到 LAUNCHED可件通知:
LAUNCHED Handle
收到LAUNCHED通知之后,AppAttempt向监视线程注册, 之后等待Master启动运行的消息,master 启动之后,必须要向ResourceManager注册自己, 这时Resourcemanager会把这个注册事件发给appAttempt处理,
REGISTERED Handle
AppAttempt 收到 register 消息之后,保存master运行的相关信息,(host, port, trackingurl)然后通知App:
// Let the app know
appAttempt.eventHandler.handle(new RMAppEvent(appAttempt
.getAppAttemptId().getApplicationId(),
RMAppEventType.ATTEMPT_REGISTERED));
ApplicationMaster 注册之后, AM会一直发送heartbeat 消息,通过 调用ApplicationMasterService.allocate() 方法, 收到applicationMaster的heartbeat 消息之后,Scheduler会为先向RMContainer发送Acquired 事件更新已经为AM分配的container状态,RMContainer 状态更新之后发送ContainerAcquired事件通知RMAppAttempt,
CONTAINER_ACQIRED Handle
当RMAppAttempt 收到该事件后,把该container 所属的node加放自己的runnodes set中去。
appAttempt.ranNodes.add(acquiredEvent.getContainer().getNodeId());
UNREGSITERD Handle
当任务执行完成之后,AM会向 ApplicationMasterService 注销自己,AppAttempt会收到unregsitered 事件通知,appatempt会执行一系列的清除工作,最后退出。