关于Yarn源码的那些事（六）

最新推荐文章于 2023-05-04 09:06:35 发布

土豆钊

最新推荐文章于 2023-05-04 09:06:35 发布

阅读量487

点赞数

分类专栏：大数据 hadoop源码文章标签：关于Yarn源码那些事

本文链接：https://blog.csdn.net/u013384984/article/details/80336029

版权

大数据同时被 2 个专栏收录

31 篇文章 3 订阅

订阅专栏

hadoop源码

18 篇文章 4 订阅

订阅专栏

终于可以继续写ApplicationMaster提交和运行的整体流程了。

在上次分析到RMAppAttemptImpl的时候，觉得自己对于调度器和状态机了解地不是很清楚，因此暂停，先把一些需要了解的概念分析清楚，今天，继续分析ApplicationMaster提交运行的整体流程：

上文中，我们提到，RMAppAttemptImpl的状态从RMAppAttemptState.SCHEDULED，在事件RMAppAttemptEventType.CONTAINER_ALLOCATED的触发下，转换为RMAppAttemptedState.ALLOCATED_SAVING，同时，该转换操作为：AMContainerAllocatedTransition，我们看看其代码：

// Set the masterContainer
			appAttempt.setMasterContainer(amContainerAllocation.getContainers()
					.get(0));
			RMContainerImpl rmMasterContainer = (RMContainerImpl) appAttempt.scheduler
					.getRMContainer(appAttempt.getMasterContainer().getId());
			rmMasterContainer.setAMContainer(true);
			// The node set in NMTokenSecrentManager is used for marking whether
			// the
			// NMToken has been issued for this node to the AM.
			// When AM container was allocated to RM itself, the node which
			// allocates
			// this AM container was marked as the NMToken already sent. Thus,
			// clear this node set so that the following allocate requests from
			// AM are
			// able to retrieve the corresponding NMToken.
			appAttempt.rmContext.getNMTokenSecretManager()
					.clearNodeSetForAttempt(appAttempt.applicationAttemptId);
			appAttempt.getSubmissionContext().setResource(
					appAttempt.getMasterContainer().getResource());
			appAttempt.storeAttempt();
			return RMAppAttemptState.ALLOCATED_SAVING;

我们可以看到，里面对于我们本次提交的这个对象实例，主要是取出其submissionContext，即运行环境上下文，赋予了相应的资源，其中storeAttempt方法：

private void storeAttempt() {
		// store attempt data in a non-blocking manner to prevent dispatcher
		// thread starvation and wait for state to be saved
		LOG.info("Storing attempt: AppId: "
				+ getAppAttemptId().getApplicationId() + " AttemptId: "
				+ getAppAttemptId() + " MasterContainer: " + masterContainer);
		rmContext.getStateStore().storeNewApplicationAttempt(this);
	}

这里，把RMAppAttemptImpl存储到了全局存储内，这个RMStateStore初始化时候，如果相应配置是false，即不会恢复以前的应用，默认是个NullRMStateStore。

我们接着看，需要注意，这时候RMAppAttemptImpl的状态已经更新为了RMAppAttemptState.ALLOCATED_SAVING，那么，状态机的变化必然会触发相应的操作，但是这里原先状态为RMAppAttemptState.ALLOCATED_SAVING牵涉到的转换有三种，到底是哪个会触发呢？

addTransition(RMAppAttemptState.ALLOCATED_SAVING,
					RMAppAttemptState.ALLOCATED,
					RMAppAttemptEventType.ATTEMPT_NEW_SAVED,
					new AttemptStoredTransition())

			// App could be killed by the client. So need to handle this.
			.addTransition(
					RMAppAttemptState.ALLOCATED_SAVING,
					RMAppAttemptState.FINAL_SAVING,
					RMAppAttemptEventType.KILL,
					new FinalSavingTransition(new BaseFinalTransition(
							RMAppAttemptState.KILLED), RMAppAttemptState.KILLED))
			.addTransition(
					RMAppAttemptState.ALLOCATED_SAVING,
					RMAppAttemptState.FINAL_SAVING,
					RMAppAttemptEventType.CONTAINER_FINISHED,
					new FinalSavingTransition(
							new AMContainerCrashedBeforeRunningTransition(),
							RMAppAttemptState.FAILED))

毫无疑问，我们必须要找到接下来的触发事件的类型，这个事件，其实就在存储的触发内：

public synchronized void storeNewApplicationAttempt(RMAppAttempt appAttempt) {
		Credentials credentials = getCredentialsFromAppAttempt(appAttempt);

		AggregateAppResourceUsage resUsage = appAttempt
				.getRMAppAttemptMetrics().getAggregateAppResourceUsage();
		ApplicationAttemptState attemptState = new ApplicationAttemptState(
				appAttempt.getAppAttemptId(), appAttempt.getMasterContainer(),
				credentials, appAttempt.getStartTime(),
				resUsage.getMemorySeconds(), resUsage.getVcoreSeconds());

		dispatcher.getEventHandler().handle(
				new RMStateStoreAppAttemptEvent(attemptState));
	}

我们仔细看下这个方法：这个dispatcher实际上是RMStateStore内部的调度器，而非RM的全局调度器，这个类比较特殊，同时持有两个调度器：rmDispatcher是全局调度器，dispatcher是自有的调度器：

dispatcher.register(RMStateStoreEventType.class,
				new ForwardingEventHandler());

我们发现，这个事件交给了ForwardingEventHandler来处理，我们看看其处理逻辑：

@Override
		public void handle(RMStateStoreEvent event) {
			handleStoreEvent(event);
		}

// Dispatcher related code
	protected void handleStoreEvent(RMStateStoreEvent event) {
		try {
			this.stateMachine.doTransition(event.getType(), event);
		} catch (InvalidStateTransitonException e) {
			LOG.error("Can't handle this event at current state", e);
		}
	}

那么，我们必须看下这个类的类型了：RMStateStoreAppAttemptEvent：

public class RMStateStoreAppAttemptEvent extends RMStateStoreEvent {
	ApplicationAttemptState attemptState;

	public RMStateStoreAppAttemptEvent(ApplicationAttemptState attemptState) {
		super(RMStateStoreEventType.STORE_APP_ATTEMPT);
		this.attemptState = attemptState;
	}
	public ApplicationAttemptState getAppAttemptState() {
		return attemptState;
	}
}

这里，对于RMStateStore来说，一个STORE_APP_ATTEMPT的事件触发：

addTransition(RMStateStoreState.DEFAULT,
					RMStateStoreState.DEFAULT,
					RMStateStoreEventType.STORE_APP_ATTEMPT,
					new StoreAppAttemptTransition())

我们看下这里的StoreAppAttemptTransition：

ApplicationAttemptState attemptState = ((RMStateStoreAppAttemptEvent) event)
					.getAppAttemptState();
			try {
				ApplicationAttemptStateData attemptStateData = ApplicationAttemptStateData
						.newInstance(attemptState);
				if (LOG.isDebugEnabled()) {
					LOG.debug("Storing info for attempt: "
							+ attemptState.getAttemptId());
				}
				store.storeApplicationAttemptStateInternal(
						attemptState.getAttemptId(), attemptStateData);
				store.notifyApplicationAttempt(new RMAppAttemptEvent(
						attemptState.getAttemptId(),
						RMAppAttemptEventType.ATTEMPT_NEW_SAVED));
			}

摘抄出其中的重点代码，先取出提交的事件状态，然后Notify相关调度器：

private void notifyApplicationAttempt(RMAppAttemptEvent event) {
		rmDispatcher.getEventHandler().handle(event);
	}

这个调度器使用了全局的调度器，把事件放到了rmDispatcher的队列里，我们看看交给谁来处理：

rmDispatcher.register(RMAppAttemptEventType.class,
					new ApplicationAttemptEventDispatcher(rmContext));

这个类的处理方法：

public void handle(RMAppAttemptEvent event) {
			ApplicationAttemptId appAttemptID = event.getApplicationAttemptId();
			ApplicationId appAttemptId = appAttemptID.getApplicationId();
			RMApp rmApp = this.rmContext.getRMApps().get(appAttemptId);
			if (rmApp != null) {
				RMAppAttempt rmAppAttempt = rmApp.getRMAppAttempt(appAttemptID);
				if (rmAppAttempt != null) {
					try {
						rmAppAttempt.handle(event);
					} catch (Throwable t) {
						LOG.error(
								"Error in handling event type "
										+ event.getType()
										+ " for applicationAttempt "
										+ appAttemptId, t);
					}
				}
			}
		}

看看RMAppAttemptImpl内的handle方法，原先的事件类型我们明确了，是RMAppAttemptState_SCHEDULED，理论上来说是这样的，因为最后一步的返回此时还没有完毕：

appAttempt.storeAttempt();
return RMAppAttemptState.ALLOCATED_SAVING;

但实际上并非如此，因为storeAttempt方法其中使用了异步的调度机制，所以，返回来，我们这边的RMContainerImpl的状态，或者其中的状态机的状态已经变成了：ALLOCATED_SAVING，而后面传回来的触发事件类型是：

RMAppAttemptEventType.ATTEMPT_NEW_SAVED

好的，那我们看看会触发什么改变：

.addTransition(RMAppAttemptState.ALLOCATED_SAVING,
					RMAppAttemptState.ALLOCATED,
					RMAppAttemptEventType.ATTEMPT_NEW_SAVED,
					new AttemptStoredTransition())

触发了这个转变，看下：

private static final class AttemptStoredTransition extends BaseTransition {
		@Override
		public void transition(RMAppAttemptImpl appAttempt,
				RMAppAttemptEvent event) {
			appAttempt.launchAttempt();
		}
	}

private void launchAttempt() {
		// Send event to launch the AM Container
		eventHandler.handle(new AMLauncherEvent(AMLauncherEventType.LAUNCH,
				this));
	}

这个eventHandler是全局调度器，而AMLauncherEvent类型的事件，交给了：ApplicationMasterLauncher处理：

@Override
	public synchronized void handle(AMLauncherEvent appEvent) {
		AMLauncherEventType event = appEvent.getType();
		RMAppAttempt application = appEvent.getAppAttempt();
		switch (event) {
		case LAUNCH:
			launch(application);
			break;
		case CLEANUP:
			cleanup(application);
		default:
			break;
		}
	}

这里，我们提交的类型是LAUNCH：

private void launch(RMAppAttempt application) {
		Runnable launcher = createRunnableLauncher(application,
				AMLauncherEventType.LAUNCH);
		masterEvents.add(launcher);
	}

而这个服务启动后，会有一个线程在操作：

private class LauncherThread extends Thread {
		public LauncherThread() {
			super("ApplicationMaster Launcher");
		}
		@Override
		public void run() {
			while (!this.isInterrupted()) {
				Runnable toLaunch;
				try {
					toLaunch = masterEvents.take();
					launcherPool.execute(toLaunch);
				} catch (InterruptedException e) {
					LOG.warn(this.getClass().getName()
							+ " interrupted. Returning.");
					return;
				}
			}
		}
	}

本身来说，ApplicationMasterLauncher是一个抽象服务的继承类，在RM启动后触发了其serviceInit和serviceStart方法，所以这个内部线程会触发，我们的重点在于看看我们提交的时间内部的run方法，即AMLauncher内部的run方法：

case LAUNCH:
			try {
				LOG.info("Launching master" + application.getAppAttemptId());
				launch();
				handler.handle(new RMAppAttemptEvent(application
						.getAppAttemptId(), RMAppAttemptEventType.LAUNCHED));
			} catch (Exception ie) {
				String message = "Error launching "
						+ application.getAppAttemptId() + ". Got exception: "
						+ StringUtils.stringifyException(ie);
				LOG.info(message);
				handler.handle(new RMAppAttemptLaunchFailedEvent(application
						.getAppAttemptId(), message));
			}
			break;

注意，这里的handler，也是全局调度器，try中的方法一个个看，首先是launch方法：

private void launch() throws IOException, YarnException {
		connect();
		ContainerId masterContainerID = masterContainer.getId();
		ApplicationSubmissionContext applicationContext = application
				.getSubmissionContext();
		LOG.info("Setting up container " + masterContainer + " for AM "
				+ application.getAppAttemptId());
		ContainerLaunchContext launchContext = createAMContainerLaunchContext(
				applicationContext, masterContainerID);

		StartContainerRequest scRequest = StartContainerRequest.newInstance(
				launchContext, masterContainer.getContainerToken());
		List<StartContainerRequest> list = new ArrayList<StartContainerRequest>();
		list.add(scRequest);
		StartContainersRequest allRequests = StartContainersRequest
				.newInstance(list);

		StartContainersResponse response = containerMgrProxy
				.startContainers(allRequests);
		if (response.getFailedRequests() != null
				&& response.getFailedRequests().containsKey(masterContainerID)) {
			Throwable t = response.getFailedRequests().get(masterContainerID)
					.deSerialize();
			parseAndThrowException(t);
		} else {
			LOG.info("Done launching container " + masterContainer + " for AM "
					+ application.getAppAttemptId());
		}
	}

其中内容看起来很多，展开来看：

connect：

private void connect() throws IOException {
		ContainerId masterContainerID = masterContainer.getId();
		containerMgrProxy = getContainerMgrProxy(masterContainerID);
	}

要找到我们需要连接的containerId，并且与之建立RPC连接，并且保持这个RPC连接：

ContainerLaunchContext launchContext = createAMContainerLaunchContext(
				applicationContext, masterContainerID);

整合出一个完整的启动上下文：

StartContainerRequest scRequest = StartContainerRequest.newInstance(
				launchContext, masterContainer.getContainerToken());

新建一个请求，我们可以看到，这个请求里包含了启动的上下文，以及需要启动的master的token，必须要这个token，才能合法地启动对应的Container：

List<StartContainerRequest> list = new ArrayList<StartContainerRequest>();
		list.add(scRequest);

这段代码看起来很奇怪，毕竟就只有一个元素，为什么要建立一个list来传输？我觉得，可能是考虑到以后的扩展，毕竟，有可能要在多个container上启动程序：

StartContainersResponse response = containerMgrProxy
				.startContainers(allRequests);

关键在这里，我们前文保持着的RPC连接派上了作用，这个方法的名称也让人激动，前文的铺垫，终于到了可以启动Container的时候了：

这里，我们也能看到，默认情况下，Container启动的参数是1G1核的，所以，我们有时候会根据需求来更改Container的启动参数：

既然是RPC请求，那么，是谁来真正执行这个方法？

答案是根据ContainerId找到对应的NM，在其上的Container上，启动一个Container，牵涉到Container的启动有些繁琐，下文再叙：

土豆钊

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
关于Yarn源码的那些事（六）

终于可以继续写ApplicationMaster提交和运行的整体流程了。在上次分析到RMAppAttemptImpl的时候，觉得自己对于调度器和状态机了解地不是很清楚，因此暂停，先把一些需要了解的概念分析清楚，今天，继续分析ApplicationMaster提交运行的整体流程：上文中，我们提到，RMAppAttemptImpl的状态从RMAppAttemptState.SCHEDULED，在事件R...
复制链接

扫一扫

专栏目录